27 Oct, 2020

1 commit


23 Oct, 2020

2 commits

  • …cm/fs/fscrypt/fscrypt") into android-mainline

    Steps on the way to 5.10-rc1

    Change-Id: Ifceecc1b9f94ea893484002c69aeb7b82d246f64
    Signed-off-by: Greg Kroah-Hartman <gregkh@google.com>

    Greg Kroah-Hartman
     
  • Pull ext4 updates from Ted Ts'o:
    "The siginificant new ext4 feature this time around is Harshad's new
    fast_commit mode.

    In addition, thanks to Mauricio for fixing a race where mmap'ed pages
    that are being changed in parallel with a data=journal transaction
    commit could result in bad checksums in the failure that could cause
    journal replays to fail.

    Also notable is Ritesh's buffered write optimization which can result
    in significant improvements on parallel write workloads. (The kernel
    test robot reported a 330.6% improvement on fio.write_iops on a 96
    core system using DAX)

    Besides that, we have the usual miscellaneous cleanups and bug fixes"

    Link: https://lore.kernel.org/r/20200925071217.GO28663@shao2-debian

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (46 commits)
    ext4: fix invalid inode checksum
    ext4: add fast commit stats in procfs
    ext4: add a mount opt to forcefully turn fast commits on
    ext4: fast commit recovery path
    jbd2: fast commit recovery path
    ext4: main fast-commit commit path
    jbd2: add fast commit machinery
    ext4 / jbd2: add fast commit initialization
    ext4: add fast_commit feature and handling for extended mount options
    doc: update ext4 and journalling docs to include fast commit feature
    ext4: Detect already used quota file early
    jbd2: avoid transaction reuse after reformatting
    ext4: use the normal helper to get the actual inode
    ext4: fix bs < ps issue reported with dioread_nolock mount opt
    ext4: data=journal: write-protect pages on j_submit_inode_data_buffers()
    ext4: data=journal: fixes for ext4_page_mkwrite()
    jbd2, ext4, ocfs2: introduce/use journal callbacks j_submit|finish_inode_data_buffers()
    jbd2: introduce/export functions jbd2_journal_submit|finish_inode_data_buffers()
    ext4: introduce ext4_sb_bread_unmovable() to replace sb_bread_unmovable()
    ext4: use ext4_sb_bread() instead of sb_bread()
    ...

    Linus Torvalds
     

22 Oct, 2020

1 commit

  • This patch adds fast commit recovery path support for Ext4 file
    system. We add several helper functions that are similar in spirit to
    e2fsprogs journal recovery path handlers. Example of such functions
    include - a simple block allocator, idempotent block bitmap update
    function etc. Using these routines and the fast commit log in the fast
    commit area, the recovery path (ext4_fc_replay()) performs fast commit
    log recovery.

    Reported-by: kernel test robot
    Signed-off-by: Harshad Shirwadkar
    Link: https://lore.kernel.org/r/20201015203802.3597742-8-harshadshirwadkar@gmail.com
    Signed-off-by: Theodore Ts'o

    Harshad Shirwadkar
     

18 Oct, 2020

2 commits

  • Revome all open codes that read metadata buffers, switch to use
    ext4_read_bh_*() common helpers.

    Signed-off-by: zhangyi (F)
    Suggested-by: Jan Kara
    Link: https://lore.kernel.org/r/20200924073337.861472-4-yi.zhang@huawei.com
    Signed-off-by: Theodore Ts'o

    zhangyi (F)
     
  • The metadata buffer is no longer trusted after we read it from disk
    again because it is not uptodate for some reasons (e.g. failed to write
    back). Otherwise we may get below memory corruption problem in
    ext4_ext_split()->memset() if we read stale data from the newly
    allocated extent block on disk which has been failed to async write
    out but miss verify again since the verified bit has already been set
    on the buffer.

    [ 29.774674] BUG: unable to handle kernel paging request at ffff88841949d000
    ...
    [ 29.783317] Oops: 0002 [#2] SMP
    [ 29.784219] R10: 00000000000f4240 R11: 0000000000002e28 R12: ffff88842fa1c800
    [ 29.784627] CPU: 1 PID: 126 Comm: kworker/u4:3 Tainted: G D W
    [ 29.785546] R13: ffffffff9cddcc20 R14: ffffffff9cddd420 R15: ffff88842fa1c2f8
    [ 29.786679] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),BIOS ?-20190727_0738364
    [ 29.787588] FS: 0000000000000000(0000) GS:ffff88842fa00000(0000) knlGS:0000000000000000
    [ 29.789288] Workqueue: writeback wb_workfn
    [ 29.790319] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 29.790321] (flush-8:0)
    [ 29.790844] CR2: 0000000000000008 CR3: 00000004234f2000 CR4: 00000000000006f0
    [ 29.791924] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 29.792839] RIP: 0010:__memset+0x24/0x30
    [ 29.793739] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 29.794256] Code: 90 90 90 90 90 90 0f 1f 44 00 00 49 89 f9 48 89 d1 83 e2 07 48 c1 e9 033
    [ 29.795161] Kernel panic - not syncing: Fatal exception in interrupt
    ...
    [ 29.808149] Call Trace:
    [ 29.808475] ext4_ext_insert_extent+0x102e/0x1be0
    [ 29.809085] ext4_ext_map_blocks+0xa89/0x1bb0
    [ 29.809652] ext4_map_blocks+0x290/0x8a0
    [ 29.809085] ext4_ext_map_blocks+0xa89/0x1bb0
    [ 29.809652] ext4_map_blocks+0x290/0x8a0
    [ 29.810161] ext4_writepages+0xc85/0x17c0
    ...

    Fix this by clearing buffer's verified bit if we read meta block from
    disk again.

    Signed-off-by: zhangyi (F)
    Cc: stable@vger.kernel.org
    Link: https://lore.kernel.org/r/20200924073337.861472-2-yi.zhang@huawei.com
    Signed-off-by: Theodore Ts'o

    zhangyi (F)
     

22 Sep, 2020

2 commits

  • Convert ext4 to use the new functions fscrypt_prepare_new_inode() and
    fscrypt_set_context(). This avoids calling
    fscrypt_get_encryption_info() from within a transaction, which can
    deadlock because fscrypt_get_encryption_info() isn't GFP_NOFS-safe.

    For more details about this problem, see the earlier patch
    "fscrypt: add fscrypt_prepare_new_inode() and fscrypt_set_context()".

    Link: https://lore.kernel.org/r/20200917041136.178600-4-ebiggers@kernel.org
    Signed-off-by: Eric Biggers

    Eric Biggers
     
  • To compute a new inode's xattr credits, we need to know whether the
    inode will be encrypted or not. When we switch to use the new helper
    function fscrypt_prepare_new_inode(), we won't find out whether the
    inode will be encrypted until slightly later than is currently the case.
    That will require moving the code block that computes the xattr credits.

    To make this easier and reduce the length of __ext4_new_inode(), move
    this code block into a new function ext4_xattr_credits_for_new_inode().

    Link: https://lore.kernel.org/r/20200917041136.178600-3-ebiggers@kernel.org
    Signed-off-by: Eric Biggers

    Eric Biggers
     

27 Jun, 2020

1 commit


26 Jun, 2020

1 commit

  • This adds support for encryption with casefolding.

    Since the name on disk is case preserving, and also encrypted, we can no
    longer just recompute the hash on the fly. Additionally, to avoid
    leaking extra information from the hash of the unencrypted name, we use
    siphash via an fscrypt v2 policy.

    The hash is stored at the end of the directory entry for all entries
    inside of an encrypted and casefolded directory apart from those that
    deal with '.' and '..'. This way, the change is backwards compatible
    with existing ext4 filesystems.

    Signed-off-by: Daniel Rosenberg
    Signed-off-by: Paul Lawrence
    Test: Boots, /data/media is case insensitive
    Bug: 138322712
    Change-Id: I07354e3129aa07d309fbe36c002fee1af718f348

    Daniel Rosenberg
     

16 Jun, 2020

1 commit

  • Pull more ext4 updates from Ted Ts'o:
    "This is the second round of ext4 commits for 5.8 merge window [1].

    It includes the per-inode DAX support, which was dependant on the DAX
    infrastructure which came in via the XFS tree, and a number of
    regression and bug fixes; most notably the "BUG: using
    smp_processor_id() in preemptible code in ext4_mb_new_blocks" reported
    by syzkaller"

    [1] The pull request actually came in 15 minutes after I had tagged the
    rc1 release. Tssk, tssk, late.. - Linus

    * tag 'ext4-for-linus-5.8-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4, jbd2: ensure panic by fix a race between jbd2 abort and ext4 error handlers
    ext4: support xattr gnu.* namespace for the Hurd
    ext4: mballoc: Use this_cpu_read instead of this_cpu_ptr
    ext4: avoid utf8_strncasecmp() with unstable name
    ext4: stop overwrite the errcode in ext4_setup_super
    ext4: fix partial cluster initialization when splitting extent
    ext4: avoid race conditions when remounting with options that change dax
    Documentation/dax: Update DAX enablement for ext4
    fs/ext4: Introduce DAX inode flag
    fs/ext4: Remove jflag variable
    fs/ext4: Make DAX mount option a tri-state
    fs/ext4: Only change S_DAX on inode load
    fs/ext4: Update ext4_should_use_dax()
    fs/ext4: Change EXT4_MOUNT_DAX to EXT4_MOUNT_DAX_ALWAYS
    fs/ext4: Disallow verity if inode is DAX
    fs/ext4: Narrow scope of DAX check in setflags

    Linus Torvalds
     

11 Jun, 2020

1 commit


06 Jun, 2020

1 commit

  • Pull ext4 updates from Ted Ts'o:
    "A lot of bug fixes and cleanups for ext4, including:

    - Fix performance problems found in dioread_nolock now that it is the
    default, caused by transaction leaks.

    - Clean up fiemap handling in ext4

    - Clean up and refactor multiple block allocator (mballoc) code

    - Fix a problem with mballoc with a smaller file systems running out
    of blocks because they couldn't properly use blocks that had been
    reserved by inode preallocation.

    - Fixed a race in ext4_sync_parent() versus rename()

    - Simplify the error handling in the extent manipulation code

    - Make sure all metadata I/O errors are felected to
    ext4_ext_dirty()'s and ext4_make_inode_dirty()'s callers.

    - Avoid passing an error pointer to brelse in ext4_xattr_set()

    - Fix race which could result to freeing an inode on the dirty last
    in data=journal mode.

    - Fix refcount handling if ext4_iget() fails

    - Fix a crash in generic/019 caused by a corrupted extent node"

    * tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (58 commits)
    ext4: avoid unnecessary transaction starts during writeback
    ext4: don't block for O_DIRECT if IOCB_NOWAIT is set
    ext4: remove the access_ok() check in ext4_ioctl_get_es_cache
    fs: remove the access_ok() check in ioctl_fiemap
    fs: handle FIEMAP_FLAG_SYNC in fiemap_prep
    fs: move fiemap range validation into the file systems instances
    iomap: fix the iomap_fiemap prototype
    fs: move the fiemap definitions out of fs.h
    fs: mark __generic_block_fiemap static
    ext4: remove the call to fiemap_check_flags in ext4_fiemap
    ext4: split _ext4_fiemap
    ext4: fix fiemap size checks for bitmap files
    ext4: fix EXT4_MAX_LOGICAL_BLOCK macro
    add comment for ext4_dir_entry_2 file_type member
    jbd2: avoid leaking transaction credits when unreserving handle
    ext4: drop ext4_journal_free_reserved()
    ext4: mballoc: use lock for checking free blocks while retrying
    ext4: mballoc: refactor ext4_mb_good_group()
    ext4: mballoc: introduce pcpu seqcnt for freeing PA to improve ENOSPC handling
    ext4: mballoc: refactor ext4_mb_discard_preallocations()
    ...

    Linus Torvalds
     

04 Jun, 2020

1 commit

  • ext4_orphan_get() invokes ext4_read_inode_bitmap(), which returns a
    reference of the specified buffer_head object to "bitmap_bh" with
    increased refcnt.

    When ext4_orphan_get() returns, local variable "bitmap_bh" becomes
    invalid, so the refcount should be decreased to keep refcount balanced.

    The reference counting issue happens in one exception handling path of
    ext4_orphan_get(). When ext4_iget() fails, the function forgets to
    decrease the refcnt increased by ext4_read_inode_bitmap(), causing a
    refcnt leak.

    Fix this issue by calling brelse() when ext4_iget() fails.

    Signed-off-by: Xiyu Yang
    Signed-off-by: Xin Tan
    Cc: stable@kernel.org
    Link: https://lore.kernel.org/r/1587618568-13418-1-git-send-email-xiyuyang19@fudan.edu.cn
    Signed-off-by: Theodore Ts'o

    Xiyu Yang
     

29 May, 2020

1 commit

  • To prevent complications with in memory inodes we only set S_DAX on
    inode load. FS_XFLAG_DAX can be changed at any time and S_DAX will
    change after inode eviction and reload.

    Add init bool to ext4_set_inode_flags() to indicate if the inode is
    being newly initialized.

    Assert that S_DAX is not set on an inode which is just being loaded.

    Reviewed-by: Jan Kara
    Signed-off-by: Ira Weiny

    Link: https://lore.kernel.org/r/20200528150003.828793-6-ira.weiny@intel.com
    Signed-off-by: Theodore Ts'o

    Ira Weiny
     

22 May, 2020

1 commit


16 Apr, 2020

2 commits

  • Current wait times have proven to be too short to protect against inode
    reuses that lead to metadata inconsistencies.

    Now that we will retry the inode allocation if we can't find any
    recently deleted inodes, it's a lot safer to increase the recently
    deleted time from 5 seconds to a minute.

    Link: https://lore.kernel.org/r/20200414023925.273867-1-tytso@mit.edu
    Google-Bug-Id: 36602237
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     
  • The documentation comments for ext4_read_block_bitmap_nowait and
    ext4_read_inode_bitmap describe them as returning NULL on error, but
    they return an ERR_PTR on error; update the documentation to match.

    The documentation comment for ext4_wait_block_bitmap describes it as
    returning 1 on error, but it returns -errno on error; update the
    documentation to match.

    Signed-off-by: Josh Triplett
    Reviewed-by: Ritesh Harani
    Link: https://lore.kernel.org/r/60a3f4996f4932c45515aaa6b75ca42f2a78ec9b.1585512514.git.josh@joshtriplett.org
    Signed-off-by: Theodore Ts'o

    Josh Triplett
     

02 Apr, 2020

1 commit

  • Using a separate function, ext4_set_errno() to set the errno is
    problematic because it doesn't do the right thing once
    s_last_error_errorcode is non-zero. It's also less racy to set all of
    the error information all at once. (Also, as a bonus, it shrinks code
    size slightly.)

    Link: https://lore.kernel.org/r/20200329020404.686965-1-tytso@mit.edu
    Fixes: 878520ac45f9 ("ext4: save the error code which triggered...")
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

26 Mar, 2020

1 commit

  • When ext4 is running on a filesystem without a journal, it tries not to
    reuse recently deleted inodes to provide better chances for filesystem
    recovery in case of crash. However this logic forbids reuse of freed
    inodes for up to 5 minutes and especially for filesystems with smaller
    number of inodes can lead to ENOSPC errors returned when allocating new
    inodes.

    Fix the problem by allowing to reuse recently deleted inode if there's
    no other inode free in the scanned range.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20200318121317.31941-1-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

22 Feb, 2020

1 commit

  • During an online resize an array of s_flex_groups structures gets replaced
    so it can get enlarged. If there is a concurrent access to the array and
    this memory has been reused then this can lead to an invalid memory access.

    The s_flex_group array has been converted into an array of pointers rather
    than an array of structures. This is to ensure that the information
    contained in the structures cannot get out of sync during a resize due to
    an accessor updating the value in the old structure after it has been
    copied but before the array pointer is updated. Since the structures them-
    selves are no longer copied but only the pointers to them this case is
    mitigated.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
    Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu
    Signed-off-by: Suraj Jitindar Singh
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org

    Suraj Jitindar Singh
     

27 Dec, 2019

2 commits


15 Dec, 2019

1 commit

  • It's possible that __ext4_new_inode will release the xattr block, so
    it will trigger a warning since there is revoke credits will be 0 if
    the handle == NULL. The below scripts can reproduce it easily.

    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 3861 at fs/jbd2/revoke.c:374 jbd2_journal_revoke+0x30e/0x540 fs/jbd2/revoke.c:374
    ...
    __ext4_forget+0x1d7/0x800 fs/ext4/ext4_jbd2.c:248
    ext4_free_blocks+0x213/0x1d60 fs/ext4/mballoc.c:4743
    ext4_xattr_release_block+0x55b/0x780 fs/ext4/xattr.c:1254
    ext4_xattr_block_set+0x1c2c/0x2c40 fs/ext4/xattr.c:2112
    ext4_xattr_set_handle+0xa7e/0x1090 fs/ext4/xattr.c:2384
    __ext4_set_acl+0x54d/0x6c0 fs/ext4/acl.c:214
    ext4_init_acl+0x218/0x2e0 fs/ext4/acl.c:293
    __ext4_new_inode+0x352a/0x42b0 fs/ext4/ialloc.c:1151
    ext4_mkdir+0x2e9/0xbd0 fs/ext4/namei.c:2774
    vfs_mkdir+0x386/0x5f0 fs/namei.c:3811
    do_mkdirat+0x11c/0x210 fs/namei.c:3834
    do_syscall_64+0xa1/0x530 arch/x86/entry/common.c:294
    ...
    -------------------------------------

    scripts:
    mkfs.ext4 /dev/vdb
    mount /dev/vdb /mnt
    cd /mnt && mkdir dir && for i in {1..8}; do setfacl -dm "u:user_"$i":rx" dir; done
    mkdir dir/dir1 && mv dir/dir1 ./
    sh repro.sh && add some user

    [root@localhost ~]# cat repro.sh
    while [ 1 -eq 1 ]; do
    rm -rf dir
    rm -rf dir1/dir1
    mkdir dir
    for i in {1..8}; do setfacl -dm "u:test"$i":rx" dir; done
    setfacl -m "u:user_9:rx" dir &
    mkdir dir1/dir1 &
    done

    Before exec repro.sh, dir1 has inherit the default acl from dir, and
    xattr block of dir1 dir is not the same, so the h_refcount of these
    two dir's xattr block will be 1. Then repro.sh can trigger the warning
    with the situation show as below. The last h_refcount can be clear
    with mkdir, and __ext4_new_inode has not reserved revoke credits, so
    the warning will happened, fix it by reserve revoke credits in
    __ext4_new_inode.

    Thread 1 Thread 2
    mkdir dir
    set default acl(will create
    a xattr block blk1 and the
    refcount of ext4_xattr_header
    will be 1)
    ...
    mkdir dir1/dir1
    ->....->ext4_init_acl
    ->__ext4_set_acl(set default acl,
    will reuse blk1, and h_refcount
    will be 2)

    setfacl->ext4_set_acl->...
    ->ext4_xattr_block_set(will create
    new block blk2 to store xattr)

    ->__ext4_set_acl(set access acl, since
    h_refcount of blk1 is 2, will create
    blk3 to store xattr)

    ->ext4_xattr_release_block(dec
    h_refcount of blk1 to 1)
    ->ext4_xattr_release_block(dec
    h_refcount and since it is 0,
    will release the block and trigger
    the warning)

    Link: https://lore.kernel.org/r/20191213014900.47228-1-yangerkun@huawei.com
    Reported-by: Hulk Robot
    Reviewed-by: Jan Kara
    Signed-off-by: yangerkun
    Signed-off-by: Theodore Ts'o

    yangerkun
     

15 Nov, 2019

1 commit

  • Commit 8fcc3a580651 ("ext4: rework reserved cluster accounting when
    invalidating pages") moved freeing of delayed allocation reservations
    from dirty page invalidation time to time when we evict corresponding
    status extent from extent status tree. For inodes which don't have any
    blocks allocated this may actually happen only in ext4_clear_blocks()
    which is after we've dropped references to quota structures from the
    inode. Thus reservation of quota leaked. Fix the problem by clearing
    quota information from the inode only after evicting extent status tree
    in ext4_clear_inode().

    Link: https://lore.kernel.org/r/20191108115420.GI20863@quack2.suse.cz
    Reported-by: Konstantin Khlebnikov
    Fixes: 8fcc3a580651 ("ext4: rework reserved cluster accounting when invalidating pages")
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

06 Nov, 2019

1 commit

  • So far we have reserved only relatively high fixed amount of revoke
    credits for each transaction. We over-reserved by large amount for most
    cases but when freeing large directories or files with data journalling,
    the fixed amount is not enough. In fact the worst case estimate is
    inconveniently large (maximum extent size) for freeing of one extent.

    We fix this by doing proper estimate of the amount of blocks that need
    to be revoked when removing blocks from the inode due to truncate or
    hole punching and otherwise reserve just a small amount of revoke
    credits for each transaction to accommodate freeing of xattrs block or
    so.

    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20191105164437.32602-23-jack@suse.cz
    Signed-off-by: Theodore Ts'o

    Jan Kara
     

26 Apr, 2019

1 commit

  • This patch implements the actual support for case-insensitive file name
    lookups in ext4, based on the feature bit and the encoding stored in the
    superblock.

    A filesystem that has the casefold feature set is able to configure
    directories with the +F (EXT4_CASEFOLD_FL) attribute, enabling lookups
    to succeed in that directory in a case-insensitive fashion, i.e: match
    a directory entry even if the name used by userspace is not a byte per
    byte match with the disk name, but is an equivalent case-insensitive
    version of the Unicode string. This operation is called a
    case-insensitive file name lookup.

    The feature is configured as an inode attribute applied to directories
    and inherited by its children. This attribute can only be enabled on
    empty directories for filesystems that support the encoding feature,
    thus preventing collision of file names that only differ by case.

    * dcache handling:

    For a +F directory, Ext4 only stores the first equivalent name dentry
    used in the dcache. This is done to prevent unintentional duplication of
    dentries in the dcache, while also allowing the VFS code to quickly find
    the right entry in the cache despite which equivalent string was used in
    a previous lookup, without having to resort to ->lookup().

    d_hash() of casefolded directories is implemented as the hash of the
    casefolded string, such that we always have a well-known bucket for all
    the equivalencies of the same string. d_compare() uses the
    utf8_strncasecmp() infrastructure, which handles the comparison of
    equivalent, same case, names as well.

    For now, negative lookups are not inserted in the dcache, since they
    would need to be invalidated anyway, because we can't trust missing file
    dentries. This is bad for performance but requires some leveraging of
    the vfs layer to fix. We can live without that for now, and so does
    everyone else.

    * on-disk data:

    Despite using a specific version of the name as the internal
    representation within the dcache, the name stored and fetched from the
    disk is a byte-per-byte match with what the user requested, making this
    implementation 'name-preserving'. i.e. no actual information is lost
    when writing to storage.

    DX is supported by modifying the hashes used in +F directories to make
    them case/encoding-aware. The new disk hashes are calculated as the
    hash of the full casefolded string, instead of the string directly.
    This allows us to efficiently search for file names in the htree without
    requiring the user to provide an exact name.

    * Dealing with invalid sequences:

    By default, when a invalid UTF-8 sequence is identified, ext4 will treat
    it as an opaque byte sequence, ignoring the encoding and reverting to
    the old behavior for that unique file. This means that case-insensitive
    file name lookup will not work only for that file. An optional bit can
    be set in the superblock telling the filesystem code and userspace tools
    to enforce the encoding. When that optional bit is set, any attempt to
    create a file name using an invalid UTF-8 sequence will fail and return
    an error to userspace.

    * Normalization algorithm:

    The UTF-8 algorithms used to compare strings in ext4 is implemented
    lives in fs/unicode, and is based on a previous version developed by
    SGI. It implements the Canonical decomposition (NFD) algorithm
    described by the Unicode specification 12.1, or higher, combined with
    the elimination of ignorable code points (NFDi) and full
    case-folding (CF) as documented in fs/unicode/utf8_norm.c.

    NFD seems to be the best normalization method for EXT4 because:

    - It has a lower cost than NFC/NFKC (which requires
    decomposing to NFD as an intermediary step)
    - It doesn't eliminate important semantic meaning like
    compatibility decompositions.

    Although:

    - This implementation is not completely linguistic accurate, because
    different languages have conflicting rules, which would require the
    specialization of the filesystem to a given locale, which brings all
    sorts of problems for removable media and for users who use more than
    one language.

    Signed-off-by: Gabriel Krisman Bertazi
    Signed-off-by: Theodore Ts'o

    Gabriel Krisman Bertazi
     

24 Jan, 2019

1 commit


20 Dec, 2018

1 commit

  • If we receive a file handle, either from NFS or open_by_handle_at(2),
    and it points at an inode which has not been initialized, and the file
    system has metadata checksums enabled, we shouldn't try to get the
    inode, discover the checksum is invalid, and then declare the file
    system as being inconsistent.

    This can be reproduced by creating a test file system via "mke2fs -t
    ext4 -O metadata_csum /tmp/foo.img 8M", mounting it, cd'ing into that
    directory, and then running the following program.

    #define _GNU_SOURCE
    #include

    struct handle {
    struct file_handle fh;
    unsigned char fid[MAX_HANDLE_SZ];
    };

    int main(int argc, char **argv)
    {
    struct handle h = {{8, 1 }, { 12, }};

    open_by_handle_at(AT_FDCWD, &h.fh, O_RDONLY);
    return 0;
    }

    Google-Bug-Id: 120690101
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org

    Theodore Ts'o
     

11 Oct, 2018

1 commit


02 Aug, 2018

1 commit


30 Jul, 2018

2 commits

  • This is the last missing piece for the inode times on 32-bit systems:
    now that VFS interfaces use timespec64, we just need to stop truncating
    the tv_sec values for y2038 compatibililty.

    Reviewed-by: Andreas Dilger
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Theodore Ts'o

    Arnd Bergmann
     
  • Commit 8844618d8aa7: "ext4: only look at the bg_flags field if it is
    valid" will complain if block group zero does not have the
    EXT4_BG_INODE_ZEROED flag set. Unfortunately, this is not correct,
    since a freshly created file system has this flag cleared. It gets
    almost immediately after the file system is mounted read-write --- but
    the following somewhat unlikely sequence will end up triggering a
    false positive report of a corrupted file system:

    mkfs.ext4 /dev/vdc
    mount -o ro /dev/vdc /vdc
    mount -o remount,rw /dev/vdc

    Instead, when initializing the inode table for block group zero, test
    to make sure that itable_unused count is not too large, since that is
    the case that will result in some or all of the reserved inodes
    getting cleared.

    This fixes the failures reported by Eric Whiteney when running
    generic/230 and generic/231 in the the nojournal test case.

    Fixes: 8844618d8aa7 ("ext4: only look at the bg_flags field if it is valid")
    Reported-by: Eric Whitney
    Signed-off-by: Theodore Ts'o

    Theodore Ts'o
     

13 Jul, 2018

1 commit

  • With commit 044e6e3d74a3: "ext4: don't update checksum of new
    initialized bitmaps" the buffer valid bit will get set without
    actually setting up the checksum for the allocation bitmap, since the
    checksum will get calculated once we actually allocate an inode or
    block.

    If we are doing this, then we need to (re-)check the verified bit
    after we take the block group lock. Otherwise, we could race with
    another process reading and verifying the bitmap, which would then
    complain about the checksum being invalid.

    https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1780137

    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org

    Theodore Ts'o
     

09 Jul, 2018

1 commit

  • Pull ext4 bugfixes from Ted Ts'o:
    "Bug fixes for ext4; most of which relate to vulnerabilities where a
    maliciously crafted file system image can result in a kernel OOPS or
    hang.

    At least one fix addresses an inline data bug could be triggered by
    userspace without the need of a crafted file system (although it does
    require that the inline data feature be enabled)"

    * tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
    ext4: check superblock mapped prior to committing
    ext4: add more mount time checks of the superblock
    ext4: add more inode number paranoia checks
    ext4: avoid running out of journal credits when appending to an inline file
    jbd2: don't mark block as modified if the handle is out of credits
    ext4: never move the system.data xattr out of the inode body
    ext4: clear i_data in ext4_inode_info when removing inline data
    ext4: include the illegal physical block in the bad map ext4_error msg
    ext4: verify the depth of extent tree in ext4_find_extent()
    ext4: only look at the bg_flags field if it is valid
    ext4: make sure bitmaps and the inode table don't overlap with bg descriptors
    ext4: always check block group bounds in ext4_init_block_bitmap()
    ext4: always verify the magic number in xattr blocks
    ext4: add corruption check in ext4_xattr_set_entry()
    ext4: add warn_on_error mount option

    Linus Torvalds
     

15 Jun, 2018

1 commit

  • Pull inode timestamps conversion to timespec64 from Arnd Bergmann:
    "This is a late set of changes from Deepa Dinamani doing an automated
    treewide conversion of the inode and iattr structures from 'timespec'
    to 'timespec64', to push the conversion from the VFS layer into the
    individual file systems.

    As Deepa writes:

    'The series aims to switch vfs timestamps to use struct timespec64.
    Currently vfs uses struct timespec, which is not y2038 safe.

    The series involves the following:
    1. Add vfs helper functions for supporting struct timepec64
    timestamps.
    2. Cast prints of vfs timestamps to avoid warnings after the switch.
    3. Simplify code using vfs timestamps so that the actual replacement
    becomes easy.
    4. Convert vfs timestamps to use struct timespec64 using a script.
    This is a flag day patch.

    Next steps:
    1. Convert APIs that can handle timespec64, instead of converting
    timestamps at the boundaries.
    2. Update internal data structures to avoid timestamp conversions'

    Thomas Gleixner adds:

    'I think there is no point to drag that out for the next merge
    window. The whole thing needs to be done in one go for the core
    changes which means that you're going to play that catchup game
    forever. Let's get over with it towards the end of the merge window'"

    * tag 'vfs-timespec64' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground:
    pstore: Remove bogus format string definition
    vfs: change inode times to use struct timespec64
    pstore: Convert internal records to timespec64
    udf: Simplify calls to udf_disk_stamp_to_time
    fs: nfs: get rid of memcpys for inode times
    ceph: make inode time prints to be long long
    lustre: Use long long type to print inode time
    fs: add timespec64_truncate()

    Linus Torvalds
     

14 Jun, 2018

1 commit

  • The bg_flags field in the block group descripts is only valid if the
    uninit_bg or metadata_csum feature is enabled. We were not
    consistently looking at this field; fix this.

    Also block group #0 must never have uninitialized allocation bitmaps,
    or need to be zeroed, since that's where the root inode, and other
    special inodes are set up. Check for these conditions and mark the
    file system as corrupted if they are detected.

    This addresses CVE-2018-10876.

    https://bugzilla.kernel.org/show_bug.cgi?id=199403

    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org

    Theodore Ts'o
     

06 Jun, 2018

1 commit

  • struct timespec is not y2038 safe. Transition vfs to use
    y2038 safe struct timespec64 instead.

    The change was made with the help of the following cocinelle
    script. This catches about 80% of the changes.
    All the header file and logic changes are included in the
    first 5 rules. The rest are trivial substitutions.
    I avoid changing any of the function signatures or any other
    filesystem specific data structures to keep the patch simple
    for review.

    The script can be a little shorter by combining different cases.
    But, this version was sufficient for my usecase.

    virtual patch

    @ depends on patch @
    identifier now;
    @@
    - struct timespec
    + struct timespec64
    current_time ( ... )
    {
    - struct timespec now = current_kernel_time();
    + struct timespec64 now = current_kernel_time64();
    ...
    - return timespec_trunc(
    + return timespec64_trunc(
    ... );
    }

    @ depends on patch @
    identifier xtime;
    @@
    struct \( iattr \| inode \| kstat \) {
    ...
    - struct timespec xtime;
    + struct timespec64 xtime;
    ...
    }

    @ depends on patch @
    identifier t;
    @@
    struct inode_operations {
    ...
    int (*update_time) (...,
    - struct timespec t,
    + struct timespec64 t,
    ...);
    ...
    }

    @ depends on patch @
    identifier t;
    identifier fn_update_time =~ "update_time$";
    @@
    fn_update_time (...,
    - struct timespec *t,
    + struct timespec64 *t,
    ...) { ... }

    @ depends on patch @
    identifier t;
    @@
    lease_get_mtime( ... ,
    - struct timespec *t
    + struct timespec64 *t
    ) { ... }

    @te depends on patch forall@
    identifier ts;
    local idexpression struct inode *inode_node;
    identifier i_xtime =~ "^i_[acm]time$";
    identifier ia_xtime =~ "^ia_[acm]time$";
    identifier fn_update_time =~ "update_time$";
    identifier fn;
    expression e, E3;
    local idexpression struct inode *node1;
    local idexpression struct inode *node2;
    local idexpression struct iattr *attr1;
    local idexpression struct iattr *attr2;
    local idexpression struct iattr attr;
    identifier i_xtime1 =~ "^i_[acm]time$";
    identifier i_xtime2 =~ "^i_[acm]time$";
    identifier ia_xtime1 =~ "^ia_[acm]time$";
    identifier ia_xtime2 =~ "^ia_[acm]time$";
    @@
    (
    (
    - struct timespec ts;
    + struct timespec64 ts;
    |
    - struct timespec ts = current_time(inode_node);
    + struct timespec64 ts = current_time(inode_node);
    )

    i_xtime, &ts)
    + timespec64_equal(&inode_node->i_xtime, &ts)
    |
    - timespec_equal(&ts, &inode_node->i_xtime)
    + timespec64_equal(&ts, &inode_node->i_xtime)
    |
    - timespec_compare(&inode_node->i_xtime, &ts)
    + timespec64_compare(&inode_node->i_xtime, &ts)
    |
    - timespec_compare(&ts, &inode_node->i_xtime)
    + timespec64_compare(&ts, &inode_node->i_xtime)
    |
    ts = current_time(e)
    |
    fn_update_time(..., &ts,...)
    |
    inode_node->i_xtime = ts
    |
    node1->i_xtime = ts
    |
    ts = inode_node->i_xtime
    |
    ia_xtime ...+> = ts
    |
    ts = attr1->ia_xtime
    |
    ts.tv_sec
    |
    ts.tv_nsec
    |
    btrfs_set_stack_timespec_sec(..., ts.tv_sec)
    |
    btrfs_set_stack_timespec_nsec(..., ts.tv_nsec)
    |
    - ts = timespec64_to_timespec(
    + ts =
    ...
    -)
    |
    - ts = ktime_to_timespec(
    + ts = ktime_to_timespec64(
    ...)
    |
    - ts = E3
    + ts = timespec_to_timespec64(E3)
    |
    - ktime_get_real_ts(&ts)
    + ktime_get_real_ts64(&ts)
    |
    fn(...,
    - ts
    + timespec64_to_timespec(ts)
    ,...)
    )
    ...+>
    (

    )
    |
    - timespec_equal(&node1->i_xtime1, &node2->i_xtime2)
    + timespec64_equal(&node1->i_xtime2, &node2->i_xtime2)
    |
    - timespec_equal(&node1->i_xtime1, &attr2->ia_xtime2)
    + timespec64_equal(&node1->i_xtime2, &attr2->ia_xtime2)
    |
    - timespec_compare(&node1->i_xtime1, &node2->i_xtime2)
    + timespec64_compare(&node1->i_xtime1, &node2->i_xtime2)
    |
    node1->i_xtime1 =
    - timespec_trunc(attr1->ia_xtime1,
    + timespec64_trunc(attr1->ia_xtime1,
    ...)
    |
    - attr1->ia_xtime1 = timespec_trunc(attr2->ia_xtime2,
    + attr1->ia_xtime1 = timespec64_trunc(attr2->ia_xtime2,
    ...)
    |
    - ktime_get_real_ts(&attr1->ia_xtime1)
    + ktime_get_real_ts64(&attr1->ia_xtime1)
    |
    - ktime_get_real_ts(&attr.ia_xtime1)
    + ktime_get_real_ts64(&attr.ia_xtime1)
    )

    @ depends on patch @
    struct inode *node;
    struct iattr *attr;
    identifier fn;
    identifier i_xtime =~ "^i_[acm]time$";
    identifier ia_xtime =~ "^ia_[acm]time$";
    expression e;
    @@
    (
    - fn(node->i_xtime);
    + fn(timespec64_to_timespec(node->i_xtime));
    |
    fn(...,
    - node->i_xtime);
    + timespec64_to_timespec(node->i_xtime));
    |
    - e = fn(attr->ia_xtime);
    + e = fn(timespec64_to_timespec(attr->ia_xtime));
    )

    @ depends on patch forall @
    struct inode *node;
    struct iattr *attr;
    identifier i_xtime =~ "^i_[acm]time$";
    identifier ia_xtime =~ "^ia_[acm]time$";
    identifier fn;
    @@
    {
    + struct timespec ts;
    i_xtime);
    fn (...,
    - &node->i_xtime,
    + &ts,
    ...);
    |
    + ts = timespec64_to_timespec(attr->ia_xtime);
    fn (...,
    - &attr->ia_xtime,
    + &ts,
    ...);
    )
    ...+>
    }

    @ depends on patch forall @
    struct inode *node;
    struct iattr *attr;
    struct kstat *stat;
    identifier ia_xtime =~ "^ia_[acm]time$";
    identifier i_xtime =~ "^i_[acm]time$";
    identifier xtime =~ "^[acm]time$";
    identifier fn, ret;
    @@
    {
    + struct timespec ts;
    i_xtime);
    ret = fn (...,
    - &node->i_xtime,
    + &ts,
    ...);
    |
    + ts = timespec64_to_timespec(node->i_xtime);
    ret = fn (...,
    - &node->i_xtime);
    + &ts);
    |
    + ts = timespec64_to_timespec(attr->ia_xtime);
    ret = fn (...,
    - &attr->ia_xtime,
    + &ts,
    ...);
    |
    + ts = timespec64_to_timespec(attr->ia_xtime);
    ret = fn (...,
    - &attr->ia_xtime);
    + &ts);
    |
    + ts = timespec64_to_timespec(stat->xtime);
    ret = fn (...,
    - &stat->xtime);
    + &ts);
    )
    ...+>
    }

    @ depends on patch @
    struct inode *node;
    struct inode *node2;
    identifier i_xtime1 =~ "^i_[acm]time$";
    identifier i_xtime2 =~ "^i_[acm]time$";
    identifier i_xtime3 =~ "^i_[acm]time$";
    struct iattr *attrp;
    struct iattr *attrp2;
    struct iattr attr ;
    identifier ia_xtime1 =~ "^ia_[acm]time$";
    identifier ia_xtime2 =~ "^ia_[acm]time$";
    struct kstat *stat;
    struct kstat stat1;
    struct timespec64 ts;
    identifier xtime =~ "^[acmb]time$";
    expression e;
    @@
    (
    ( node->i_xtime2 \| attrp->ia_xtime2 \| attr.ia_xtime2 \) = node->i_xtime1 ;
    |
    node->i_xtime2 = \( node2->i_xtime1 \| timespec64_trunc(...) \);
    |
    node->i_xtime2 = node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
    |
    node->i_xtime1 = node->i_xtime3 = \(ts \| current_time(...) \);
    |
    stat->xtime = node2->i_xtime1;
    |
    stat1.xtime = node2->i_xtime1;
    |
    ( node->i_xtime2 \| attrp->ia_xtime2 \) = attrp->ia_xtime1 ;
    |
    ( attrp->ia_xtime1 \| attr.ia_xtime1 \) = attrp2->ia_xtime2;
    |
    - e = node->i_xtime1;
    + e = timespec64_to_timespec( node->i_xtime1 );
    |
    - e = attrp->ia_xtime1;
    + e = timespec64_to_timespec( attrp->ia_xtime1 );
    |
    node->i_xtime1 = current_time(...);
    |
    node->i_xtime2 = node->i_xtime1 = node->i_xtime3 =
    - e;
    + timespec_to_timespec64(e);
    |
    node->i_xtime1 = node->i_xtime3 =
    - e;
    + timespec_to_timespec64(e);
    |
    - node->i_xtime1 = e;
    + node->i_xtime1 = timespec_to_timespec64(e);
    )

    Signed-off-by: Deepa Dinamani
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:

    Deepa Dinamani
     

13 May, 2018

1 commit

  • There are still some cases that we missed to set
    block bitmaps corrupted bit properly:

    1)inode bitmap number is wrong.
    2)failed to read block bitmap due to disk errors.
    3)double allocations from bitmap

    Also remove a duplicated call ext4_error() afer
    ext4_read_inode_bitmap(), as ext4_error() have been
    called inside ext4_read_inode_bitmap() properly.

    Signed-off-by: Wang Shilong
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Andreas Dilger

    Wang Shilong
     

12 May, 2018

1 commit