04 Jun, 2020

1 commit

  • ext4_mark_inode_dirty() can fail for real reasons. Ignoring its return
    value may lead ext4 to ignore real failures that would result in
    corruption / crashes. Harden ext4_mark_inode_dirty error paths to fail
    as soon as possible and return errors to the caller whenever
    appropriate.

    One of the possible scnearios when this bug could affected is that
    while creating a new inode, its directory entry gets added
    successfully but while writing the inode itself mark_inode_dirty
    returns error which is ignored. This would result in inconsistency
    that the directory entry points to a non-existent inode.

    Ran gce-xfstests smoke tests and verified that there were no
    regressions.

    Signed-off-by: Harshad Shirwadkar
    Link: https://lore.kernel.org/r/20200427013438.219117-1-harshadshirwadkar@gmail.com
    Signed-off-by: Theodore Ts'o

    Harshad Shirwadkar
     

22 Feb, 2020

1 commit

  • If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running
    on it, the following warning in ext4_add_complete_io() can be hit:

    WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120

    Here's a minimal reproducer (not 100% reliable) (root isn't required):

    while true; do
    sync
    done &
    while true; do
    rm -f file
    touch file
    chattr -e file
    echo X >> file
    chattr +e file
    done

    The problem is that in ext4_writepages(), ext4_should_dioread_nolock()
    (which only returns true on extent-based files) is checked once to set
    the number of reserved journal credits, and also again later to select
    the flags for ext4_map_blocks() and copy the reserved journal handle to
    ext4_io_end::handle. But if EXT4_EXTENTS_FL is being concurrently set,
    the first check can see dioread_nolock disabled while the later one can
    see it enabled, causing the reserved handle to unexpectedly be NULL.

    Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races
    related to doing so as well, fix this by synchronizing changing
    EXT4_EXTENTS_FL with ext4_writepages() via the existing
    s_writepages_rwsem (previously called s_journal_flag_rwsem).

    This was originally reported by syzbot without a reproducer at
    https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf,
    but now that dioread_nolock is the default I also started seeing this
    when running syzkaller locally.

    Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org
    Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com
    Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io")
    Signed-off-by: Eric Biggers
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: stable@kernel.org

    Eric Biggers
     

06 Nov, 2019

2 commits

  • So far we have reserved only relatively high fixed amount of revoke
    credits for each transaction. We over-reserved by large amount for most
    cases but when freeing large directories or files with data journalling,
    the fixed amount is not enough. In fact the worst case estimate is
    inconveniently large (maximum extent size) for freeing of one extent.

    We fix this by doing proper estimate of the amount of blocks that need
    to be revoked when removing blocks from the inode due to truncate or
    hole punching and otherwise reserve just a small amount of revoke
    credits for each transaction to accommodate freeing of xattrs block or
    so.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-23-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     
  • Provide ext4_journal_ensure_credits_fn() function to ensure transaction
    has given amount of credits and call helper function to prepare for
    restarting a transaction. This allows to remove some boilerplate code
    from various places, add proper error handling for the case where
    transaction extension or restart fails, and reduces following changes
    needed for proper revoke record reservation tracking.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-10-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

04 Dec, 2018

1 commit


26 Nov, 2018

1 commit

  • Today, when sb_bread() returns NULL, this can either be because of an
    I/O error or because the system failed to allocate the buffer. Since
    it's an old interface, changing would require changing many call
    sites.

    So instead we create our own ext4_sb_bread(), which also allows us to
    set the REQ_META flag.

    Also fixed a problem in the xattr code where a NULL return in a
    function could also mean that the xattr was not found, which could
    lead to the wrong error getting returned to userspace.

    Fixes: ac27a0ec112a ("ext4: initial copy of files from ext3")
    Cc: stable@kernel.org # 2.6.19
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

18 Dec, 2017

1 commit

  • A number of ext4 source files were skipped due because their copyright
    permission statements didn't match the expected text used by the
    automated conversion utilities. I've added SPDX tags for the rest.

    While looking at some of these files, I've noticed that we have quite
    a bit of variation on the licenses that were used --- in particular
    some of the Red Hat licenses on the jbd2 files use a GPL2+ license,
    and we have some files that have a LGPL-2.1 license (which was quite
    surprising).

    I've not attempted to do any license changes. Even if it is perfectly
    legal to relicense to GPL 2.0-only for consistency's sake, that should
    be done with ext4 developer community discussion.

    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

22 Jun, 2017

1 commit

  • We don't need acls on xattr inodes because they are not directly
    accessible from user mode.

    Besides lockdep complains about recursive locking of xattr_sem as seen
    below.

    =============================================
    [ INFO: possible recursive locking detected ]
    4.11.0-rc8+ #402 Not tainted
    ---------------------------------------------
    python/1894 is trying to acquire lock:
    (&ei->xattr_sem){++++..}, at: [] ext4_xattr_get+0x66/0x270

    but task is already holding lock:
    (&ei->xattr_sem){++++..}, at: [] ext4_xattr_set_handle+0xa0/0x5d0

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(&ei->xattr_sem);
    lock(&ei->xattr_sem);

    *** DEADLOCK ***

    May be due to missing lock nesting notation

    3 locks held by python/1894:
    #0: (sb_writers#10){.+.+.+}, at: [] mnt_want_write+0x1f/0x50
    #1: (&sb->s_type->i_mutex_key#15){+.+...}, at: [] vfs_setxattr+0x57/0xb0
    #2: (&ei->xattr_sem){++++..}, at: [] ext4_xattr_set_handle+0xa0/0x5d0

    stack backtrace:
    CPU: 0 PID: 1894 Comm: python Not tainted 4.11.0-rc8+ #402
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    dump_stack+0x67/0x99
    __lock_acquire+0x5f3/0x1830
    lock_acquire+0xb5/0x1d0
    down_read+0x2f/0x60
    ext4_xattr_get+0x66/0x270
    ext4_get_acl+0x43/0x1e0
    get_acl+0x72/0xf0
    posix_acl_create+0x5e/0x170
    ext4_init_acl+0x21/0xc0
    __ext4_new_inode+0xffd/0x16b0
    ext4_xattr_set_entry+0x5ea/0xb70
    ext4_xattr_block_set+0x1b5/0x970
    ext4_xattr_set_handle+0x351/0x5d0
    ext4_xattr_set+0x124/0x180
    ext4_xattr_user_set+0x34/0x40
    __vfs_setxattr+0x66/0x80
    __vfs_setxattr_noperm+0x69/0x1c0
    vfs_setxattr+0xa2/0xb0
    setxattr+0x129/0x160
    path_setxattr+0x87/0xb0
    SyS_setxattr+0xf/0x20
    entry_SYSCALL_64_fastpath+0x18/0xad

    Signed-off-by: Tahsin Erdogan
    Signed-off-by: Theodore Ts'o

    Tahsin Erdogan
     

10 Mar, 2016

1 commit


18 Oct, 2015

1 commit


04 Jul, 2015

2 commits

  • Currently ext4_ind_migrate() doesn't correctly handle a file which
    contains a hole at the beginning of the file. This caused the migration
    to be done incorrectly, and then if there is a subsequent following
    delayed allocation write to the "hole", this would reclaim the same data
    blocks again and results in fs corruption.

    # assmuing 4k block size ext4, with delalloc enabled
    # skip the first block and write to the second block
    xfs_io -fc "pwrite 4k 4k" -c "fsync" /mnt/ext4/testfile

    # converting to indirect-mapped file, which would move the data blocks
    # to the beginning of the file, but extent status cache still marks
    # that region as a hole
    chattr -e /mnt/ext4/testfile

    # delayed allocation writes to the "hole", reclaim the same data block
    # again, results in i_blocks corruption
    xfs_io -c "pwrite 0 4k" /mnt/ext4/testfile
    umount /mnt/ext4
    e2fsck -nf /dev/sda6
    ...
    Inode 53, i_blocks is 16, should be 8. Fix? no
    ...

    Signed-off-by: Eryu Guan
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Eryu Guan
     
  • Currently the check in ext4_ind_migrate() is not enough before doing the
    real conversion:

    a) delayed allocated extents could bypass the check on eh->eh_entries
    and eh->eh_depth

    This can be demonstrated by this script

    xfs_io -fc "pwrite 0 4k" -c "pwrite 8k 4k" /mnt/ext4/testfile
    chattr -e /mnt/ext4/testfile

    where testfile has two extents but still be converted to non-extent
    based file format.

    b) only extent length is checked but not the offset, which would result
    in data lose (delalloc) or fs corruption (nodelalloc), because
    non-extent based file only supports at most (12 + 2^10 + 2^20 + 2^30)
    blocks

    This can be demostrated by

    xfs_io -fc "pwrite 5T 4k" /mnt/ext4/testfile
    chattr -e /mnt/ext4/testfile
    sync

    If delalloc is enabled, dmesg prints
    EXT4-fs warning (device dm-4): ext4_block_to_path:105: block 1342177280 > max in inode 53
    EXT4-fs (dm-4): Delayed block allocation failed for inode 53 at logical offset 1342177280 with max blocks 1 with error 5
    EXT4-fs (dm-4): This should not happen!! Data will be lost

    If delalloc is disabled, e2fsck -nf shows corruption
    Inode 53, i_size is 5497558142976, should be 4096. Fix? no

    Fix the two issues by

    a) forcing all delayed allocation blocks to be allocated before checking
    eh->eh_depth and eh->eh_entries
    b) limiting the last logical block of the extent is within direct map

    Signed-off-by: Eryu Guan
    Signed-off-by: Theodore Ts'o
    Cc: stable@vger.kernel.org

    Eryu Guan
     

16 Apr, 2015

1 commit


26 Nov, 2014

1 commit


02 Sep, 2014

3 commits


28 Jul, 2014

1 commit


13 May, 2014

1 commit


29 Aug, 2013

1 commit


17 Aug, 2013

1 commit

  • When we read in an extent tree leaf block from disk, arrange to have
    all of its entries cached. In nearly all cases the in-memory
    representation will be more compact than the on-disk representation in
    the buffer cache, and it allows us to get the information without
    having to traverse the extent tree for successive extents.

    Signed-off-by: "Theodore Ts'o"
    Reviewed-by: Zheng Liu

    Theodore Ts'o
     

11 Apr, 2013

2 commits

  • With bigalloc feature enabled we do not support indirect addressing at all
    so we have to prevent extent addressing to indirect addressing
    conversion in this case. The problem has been introduced with the commit
    "ext4: support simple conversion of extent-mapped inodes to use i_blocks"

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     
  • Move ext4_ind_migrate() into migrate.c file since it makes much more
    sense and ext4_ext_migrate() is there as well.

    Also fix tiny style problem - add spaces around "=" in "i=0".

    Signed-off-by: Lukas Czerner
    Signed-off-by: "Theodore Ts'o"

    Lukas Czerner
     

10 Feb, 2013

1 commit


09 Feb, 2013

1 commit

  • So we can better understand what bits of ext4 are responsible for
    long-running jbd2 handles, use jbd2__journal_start() so we can pass
    context information for logging purposes.

    The recommended way for finding the longer-running handles is:

    T=/sys/kernel/debug/tracing
    EVENT=$T/events/jbd2/jbd2_handle_stats
    echo "interval > 5" > $EVENT/filter
    echo 1 > $EVENT/enable

    ./run-my-fs-benchmark

    cat $T/trace > /tmp/problem-handles

    This will list handles that were active for longer than 20ms. Having
    longer-running handles is bad, because a commit started at the wrong
    time could stall for those 20+ milliseconds, which could delay an
    fsync() or an O_SYNC operation. Here is an example line from the
    trace file describing a handle which lived on for 311 jiffies, or over
    1.2 seconds:

    postmark-2917 [000] .... 196.435786: jbd2_handle_stats: dev 254,32
    tid 570 type 2 line_no 2541 interval 311 sync 0 requested_blocks 1
    dirtied_blocks 0

    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

29 Nov, 2012

1 commit

  • Previously, ext4_extents.h was being included at the end of ext4.h,
    which was bad for a number of reasons: (a) it was not being included
    in the expected place, and (b) it caused the header to be included
    multiple times. There were #ifdef's to prevent this from causing any
    problems, but it still was unnecessary.

    By moving the function declarations that were in ext4_extents.h to
    ext4.h, which is standard practice for where the function declarations
    for the rest of ext4.h can be found, we can remove ext4_extents.h from
    being included in ext4.h at all, and then we can only include
    ext4_extents.h where it is needed in ext4's source files.

    It should be possible to move a few more things into ext4.h, and
    further reduce the number of source files that need to #include
    ext4_extents.h, but that's a cleanup for another day.

    Reported-by: Sachin Kamat
    Reported-by: Wei Yongjun
    Signed-off-by: "Theodore Ts'o"

    Theodore Ts'o
     

16 May, 2012

1 commit


21 Feb, 2012

1 commit


09 Jan, 2012

1 commit

  • Delete any instances of include module.h that were not strictly
    required. In the case of ext2, the declaration of MODULE_LICENSE
    etc. were in inode.c but the module_init/exit were in super.c, so
    relocate the MODULE_LICENCE/AUTHOR block to super.c which makes it
    consistent with ext3 and ext4 at the same time.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: Jan Kara

    Paul Gortmaker
     

03 Nov, 2011

1 commit

  • * 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/hch/vfs-queue:
    vfs: add d_prune dentry operation
    vfs: protect i_nlink
    filesystems: add set_nlink()
    filesystems: add missing nlink wrappers
    logfs: remove unnecessary nlink setting
    ocfs2: remove unnecessary nlink setting
    jfs: remove unnecessary nlink setting
    hypfs: remove unnecessary nlink setting
    vfs: ignore error on forced remount
    readlinkat: ensure we return ENOENT for the empty pathname for normal lookups
    vfs: fix dentry leak in simple_fill_super()

    Linus Torvalds
     

02 Nov, 2011

1 commit


29 Oct, 2011

2 commits

  • The tmp_inode should have same uid/gid as the original inode.
    Otherwise new metadata blocks will be accounted to wrong quota-id,
    which will result in a quota leak after the inode migration is
    completed.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     
  • This patch cleanup code a bit, actual logic not changed
    - Move current block pointer to migrate_structure, let's all
    walk info will be in one structure.
    - Get rid of usless null ind-block ptr checks, caller already
    does that check.

    Signed-off-by: Dmitry Monakhov
    Signed-off-by: "Theodore Ts'o"

    Dmitry Monakhov
     

10 Sep, 2011

1 commit


03 May, 2011

1 commit


31 Mar, 2011

1 commit


22 Feb, 2011

1 commit


11 Jan, 2011

1 commit


28 Oct, 2010

1 commit


14 Jun, 2010

1 commit