01 Oct, 2020

2 commits

  • [ Upstream commit eb5760863fc28feab28b567ddcda7e667e638da0 ]

    We already has similar code in ext4_mb_complex_scan_group(), but
    ext4_mb_simple_scan_group() still affected.

    Other reports: https://www.spinics.net/lists/linux-ext4/msg60231.html

    Reviewed-by: Andreas Dilger
    Signed-off-by: Dmitry Monakhov
    Link: https://lore.kernel.org/r/20200310150156.641-1-dmonakhov@gmail.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Dmitry Monakhov
     
  • [ Upstream commit dce8e237100f60c28cc66effb526ba65a01d8cb3 ]

    KCSAN find inode->i_disksize could be accessed concurrently.

    BUG: KCSAN: data-race in ext4_mark_iloc_dirty / ext4_write_end

    write (marked) to 0xffff8b8932f40090 of 8 bytes by task 66792 on cpu 0:
    ext4_write_end+0x53f/0x5b0
    ext4_da_write_end+0x237/0x510
    generic_perform_write+0x1c4/0x2a0
    ext4_buffered_write_iter+0x13a/0x210
    ext4_file_write_iter+0xe2/0x9b0
    new_sync_write+0x29c/0x3a0
    __vfs_write+0x92/0xa0
    vfs_write+0xfc/0x2a0
    ksys_write+0xe8/0x140
    __x64_sys_write+0x4c/0x60
    do_syscall_64+0x8a/0x2a0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    read to 0xffff8b8932f40090 of 8 bytes by task 14414 on cpu 1:
    ext4_mark_iloc_dirty+0x716/0x1190
    ext4_mark_inode_dirty+0xc9/0x360
    ext4_convert_unwritten_extents+0x1bc/0x2a0
    ext4_convert_unwritten_io_end_vec+0xc5/0x150
    ext4_put_io_end+0x82/0x130
    ext4_writepages+0xae7/0x16f0
    do_writepages+0x64/0x120
    __writeback_single_inode+0x7d/0x650
    writeback_sb_inodes+0x3a4/0x860
    __writeback_inodes_wb+0xc4/0x150
    wb_writeback+0x43f/0x510
    wb_workfn+0x3b2/0x8a0
    process_one_work+0x39b/0x7e0
    worker_thread+0x88/0x650
    kthread+0x1d4/0x1f0
    ret_from_fork+0x35/0x40

    The plain read is outside of inode->i_data_sem critical section
    which results in a data race. Fix it by adding READ_ONCE().

    Signed-off-by: Qiujun Huang
    Link: https://lore.kernel.org/r/1582556566-3909-1-git-send-email-hqjagain@gmail.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Qiujun Huang
     

03 Sep, 2020

6 commits

  • [ Upstream commit 377254b2cd2252c7c3151b113cbdf93a7736c2e9 ]

    If a device is hot-removed --- for example, when a physical device is
    unplugged from pcie slot or a nbd device's network is shutdown ---
    this can result in a BUG_ON() crash in submit_bh_wbc(). This is
    because the when the block device dies, the buffer heads will have
    their Buffer_Mapped flag get cleared, leading to the crash in
    submit_bh_wbc.

    We had attempted to work around this problem in commit a17712c8
    ("ext4: check superblock mapped prior to committing"). Unfortunately,
    it's still possible to hit the BUG_ON(!buffer_mapped(bh)) if the
    device dies between when the work-around check in ext4_commit_super()
    and when submit_bh_wbh() is finally called:

    Code path:
    ext4_commit_super
    judge if 'buffer_mapped(sbh)' is false, return
    Signed-off-by: Sasha Levin

    Xianting Tian
     
  • [ Upstream commit 0f5bde1db174f6c471f0bd27198575719dabe3e5 ]

    When remounting filesystem fails late during remount handling and
    block_validity mount option is also changed during the remount, we fail
    to restore system zone information to a state matching the mount option.
    This is mostly harmless, just the block validity checking will not match
    the situation described by the mount option. Make sure these two are always
    consistent.

    Reported-by: Lukas Czerner
    Reviewed-by: Lukas Czerner
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20200728130437.7804-7-jack@suse.cz
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Jan Kara
     
  • [ Upstream commit d176b1f62f242ab259ff665a26fbac69db1aecba ]

    ext4_setup_system_zone() can fail. Handle the failure in ext4_remount().

    Reviewed-by: Lukas Czerner
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20200728130437.7804-2-jack@suse.cz
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Jan Kara
     
  • [ Upstream commit f25391ebb475d3ffb3aa61bb90e3594c841749ef ]

    Currently there is a problem with mount options that can be both set by
    vfs using mount flags or by a string parsing in ext4.

    i_version/iversion options gets lost after remount, for example

    $ mount -o i_version /dev/pmem0 /mnt
    $ grep pmem0 /proc/self/mountinfo | grep i_version
    310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,seclabel,i_version
    $ mount -o remount,ro /mnt
    $ grep pmem0 /proc/self/mountinfo | grep i_version

    nolazytime gets ignored by ext4 on remount, for example

    $ mount -o lazytime /dev/pmem0 /mnt
    $ grep pmem0 /proc/self/mountinfo | grep lazytime
    310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel
    $ mount -o remount,nolazytime /mnt
    $ grep pmem0 /proc/self/mountinfo | grep lazytime
    310 95 259:0 / /mnt rw,relatime shared:163 - ext4 /dev/pmem0 rw,lazytime,seclabel

    Fix it by applying the SB_LAZYTIME and SB_I_VERSION flags from *flags to
    s_flags before we parse the option and use the resulting state of the
    same flags in *flags at the end of successful remount.

    Signed-off-by: Lukas Czerner
    Reviewed-by: Ritesh Harjani
    Link: https://lore.kernel.org/r/20200723150526.19931-1-lczerner@redhat.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Lukas Czerner
     
  • [ Upstream commit 273108fa5015eeffc4bacfa5ce272af3434b96e4 ]

    Ext4 uses blkdev_get_by_dev() to get the block_device for journal device
    which does check to see if the read-only block device was opened
    read-only.

    As a result ext4 will hapily proceed mounting the file system with
    external journal on read-only device. This is bad as we would not be
    able to use the journal leading to errors later on.

    Instead of simply failing to mount file system in this case, treat it in
    a similar way we treat internal journal on read-only device. Allow to
    mount with -o noload in read-only mode.

    This can be reproduced easily like this:

    mke2fs -F -O journal_dev $JOURNAL_DEV 100M
    mkfs.$FSTYPE -F -J device=$JOURNAL_DEV $FS_DEV
    blockdev --setro $JOURNAL_DEV
    mount $FS_DEV $MNT
    touch $MNT/file
    umount $MNT

    leading to error like this

    [ 1307.318713] ------------[ cut here ]------------
    [ 1307.323362] generic_make_request: Trying to write to read-only block-device dm-2 (partno 0)
    [ 1307.331741] WARNING: CPU: 36 PID: 3224 at block/blk-core.c:855 generic_make_request_checks+0x2c3/0x580
    [ 1307.341041] Modules linked in: ext4 mbcache jbd2 rfkill intel_rapl_msr intel_rapl_common isst_if_commd
    [ 1307.419445] CPU: 36 PID: 3224 Comm: jbd2/dm-2 Tainted: G W I 5.8.0-rc5 #2
    [ 1307.427359] Hardware name: Dell Inc. PowerEdge R740/01KPX8, BIOS 2.3.10 08/15/2019
    [ 1307.434932] RIP: 0010:generic_make_request_checks+0x2c3/0x580
    [ 1307.440676] Code: 94 03 00 00 48 89 df 48 8d 74 24 08 c6 05 cf 2b 18 01 01 e8 7f a4 ff ff 48 c7 c7 50e
    [ 1307.459420] RSP: 0018:ffffc0d70eb5fb48 EFLAGS: 00010286
    [ 1307.464646] RAX: 0000000000000000 RBX: ffff9b33b2978300 RCX: 0000000000000000
    [ 1307.471780] RDX: ffff9b33e12a81e0 RSI: ffff9b33e1298000 RDI: ffff9b33e1298000
    [ 1307.478913] RBP: ffff9b7b9679e0c0 R08: 0000000000000837 R09: 0000000000000024
    [ 1307.486044] R10: 0000000000000000 R11: ffffc0d70eb5f9f0 R12: 0000000000000400
    [ 1307.493177] R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
    [ 1307.500308] FS: 0000000000000000(0000) GS:ffff9b33e1280000(0000) knlGS:0000000000000000
    [ 1307.508396] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1307.514142] CR2: 000055eaf4109000 CR3: 0000003dee40a006 CR4: 00000000007606e0
    [ 1307.521273] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 1307.528407] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 1307.535538] PKRU: 55555554
    [ 1307.538250] Call Trace:
    [ 1307.540708] generic_make_request+0x30/0x340
    [ 1307.544985] submit_bio+0x43/0x190
    [ 1307.548393] ? bio_add_page+0x62/0x90
    [ 1307.552068] submit_bh_wbc+0x16a/0x190
    [ 1307.555833] jbd2_write_superblock+0xec/0x200 [jbd2]
    [ 1307.560803] jbd2_journal_update_sb_log_tail+0x65/0xc0 [jbd2]
    [ 1307.566557] jbd2_journal_commit_transaction+0x2ae/0x1860 [jbd2]
    [ 1307.572566] ? check_preempt_curr+0x7a/0x90
    [ 1307.576756] ? update_curr+0xe1/0x1d0
    [ 1307.580421] ? account_entity_dequeue+0x7b/0xb0
    [ 1307.584955] ? newidle_balance+0x231/0x3d0
    [ 1307.589056] ? __switch_to_asm+0x42/0x70
    [ 1307.592986] ? __switch_to_asm+0x36/0x70
    [ 1307.596918] ? lock_timer_base+0x67/0x80
    [ 1307.600851] kjournald2+0xbd/0x270 [jbd2]
    [ 1307.604873] ? finish_wait+0x80/0x80
    [ 1307.608460] ? commit_timeout+0x10/0x10 [jbd2]
    [ 1307.612915] kthread+0x114/0x130
    [ 1307.616152] ? kthread_park+0x80/0x80
    [ 1307.619816] ret_from_fork+0x22/0x30
    [ 1307.623400] ---[ end trace 27490236265b1630 ]---

    Signed-off-by: Lukas Czerner
    Reviewed-by: Andreas Dilger
    Link: https://lore.kernel.org/r/20200717090605.2612-1-lczerner@redhat.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Lukas Czerner
     
  • [ Upstream commit 11215630aada28307ba555a43138db6ac54fa825 ]

    A customer has reported a BUG_ON in ext4_clear_journal_err() hitting
    during an LTP testing. Either this has been caused by a test setup
    issue where the filesystem was being overwritten while LTP was mounting
    it or the journal replay has overwritten the superblock with invalid
    data. In either case it is preferable we don't take the machine down
    with a BUG_ON. So handle the situation of unexpectedly missing
    has_journal feature more gracefully. We issue warning and fail the mount
    in the cases where the race window is narrow and the failed check is
    most likely a programming error. In cases where fs corruption is more
    likely, we do full ext4_error() handling before failing mount / remount.

    Reviewed-by: Lukas Czerner
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20200710140759.18031-1-jack@suse.cz
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Jan Kara
     

26 Aug, 2020

3 commits

  • [ Upstream commit bf9a379d0980e7413d94cb18dac73db2bfc5f470 ]

    Currently, add_system_zone() just silently merges two added system zones
    that overlap. However the overlap should not happen and it generally
    suggests that some unrelated metadata overlap which indicates the fs is
    corrupted. We should have caught such problems earlier (e.g. in
    ext4_check_descriptors()) but add this check as another line of defense.
    In later patch we also use this for stricter checking of journal inode
    extent tree.

    Reviewed-by: Lukas Czerner
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20200728130437.7804-3-jack@suse.cz
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Jan Kara
     
  • [ Upstream commit 5872331b3d91820e14716632ebb56b1399b34fe1 ]

    If for any reason a directory passed to do_split() does not have enough
    active entries to exceed half the size of the block, we can end up
    iterating over all "count" entries without finding a split point.

    In this case, count == move, and split will be zero, and we will
    attempt a negative index into map[].

    Guard against this by detecting this case, and falling back to
    split-to-half-of-count instead; in this case we will still have
    plenty of space (> half blocksize) in each split block.

    Fixes: ef2b02d3e617 ("ext34: ensure do_split leaves enough free space in both blocks")
    Signed-off-by: Eric Sandeen
    Reviewed-by: Andreas Dilger
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/f53e246b-647c-64bb-16ec-135383c70ad7@redhat.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Eric Sandeen
     
  • commit 7303cb5bfe845f7d43cd9b2dbd37dbb266efda9b upstream.

    ext4_search_dir() and ext4_generic_delete_entry() can be called both for
    standard director blocks and for inline directories stored inside inode
    or inline xattr space. For the second case we didn't call
    ext4_check_dir_entry() with proper constraints that could result in
    accepting corrupted directory entry as well as false positive filesystem
    errors like:

    EXT4-fs error (device dm-0): ext4_search_dir:1395: inode #28320400:
    block 113246792: comm dockerd: bad entry in directory: directory entry too
    close to block end - offset=0, inode=28320403, rec_len=32, name_len=8,
    size=4096

    Fix the arguments passed to ext4_check_dir_entry().

    Fixes: 109ba779d6cc ("ext4: check for directory entries too close to block end")
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20200731162135.8080-1-jack@suse.cz
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

07 Aug, 2020

1 commit

  • This patch is used to fix ext4 direct I/O read error when
    the read size is not aligned with block size.

    Then, I will use a test to explain the error.

    (1) Make a file that is not aligned with block size:
    $dd if=/dev/zero of=./test.jar bs=1000 count=3

    (2) I wrote a source file named "direct_io_read_file.c" as following:

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #define BUF_SIZE 1024

    int main()
    {
    int fd;
    int ret;

    unsigned char *buf;
    ret = posix_memalign((void **)&buf, 512, BUF_SIZE);
    if (ret) {
    perror("posix_memalign failed");
    exit(1);
    }
    fd = open("./test.jar", O_RDONLY | O_DIRECT, 0755);
    if (fd < 0){
    perror("open ./test.jar failed");
    exit(1);
    }

    do {
    ret = read(fd, buf, BUF_SIZE);
    printf("ret=%d\n",ret);
    if (ret < 0) {
    perror("write test.jar failed");
    }
    } while (ret > 0);

    free(buf);
    close(fd);
    }

    (3) Compile the source file:
    $gcc direct_io_read_file.c -D_GNU_SOURCE

    (4) Run the test program:
    $./a.out

    The result is as following:
    ret=1024
    ret=1024
    ret=952
    ret=-1
    write test.jar failed: Invalid argument.

    I have tested this program on XFS filesystem, XFS does not have
    this problem, because XFS use iomap_dio_rw() to do direct I/O
    read. And the comparing between read offset and file size is done
    in iomap_dio_rw(), the code is as following:

    if (pos < size) {
    retval = filemap_write_and_wait_range(mapping, pos,
    pos + iov_length(iov, nr_segs) - 1);

    if (!retval) {
    retval = mapping->a_ops->direct_IO(READ, iocb,
    iov, pos, nr_segs);
    }
    ...
    }

    ...only when "pos < size", direct I/O can be done, or 0 will be return.

    I have tested the fix patch on Ext4, it is up to the mustard of
    EINVAL in man2(read) as following:
    #include
    ssize_t read(int fd, void *buf, size_t count);

    EINVAL
    fd is attached to an object which is unsuitable for reading;
    or the file was opened with the O_DIRECT flag, and either the
    address specified in buf, the value specified in count, or the
    current file offset is not suitably aligned.

    So I think this patch can be applied to fix ext4 direct I/O error.

    However Ext4 introduces direct I/O read using iomap infrastructure
    on kernel 5.5, the patch is commit
    ("ext4: introduce direct I/O read using iomap infrastructure"),
    then Ext4 will be the same as XFS, they all use iomap_dio_rw() to do direct
    I/O read. So this problem does not exist on kernel 5.5 for Ext4.

    >From above description, we can see this problem exists on all the kernel
    versions between kernel 3.14 and kernel 5.4. It will cause the Applications
    to fail to read. For example, when the search service downloads a new full
    index file, the search engine is loading the previous index file and is
    processing the search request, it can not use buffer io that may squeeze
    the previous index file in use from pagecache, so the serch service must
    use direct I/O read.

    Please apply this patch on these kernel versions, or please use the method
    on kernel 5.5 to fix this problem.

    Fixes: 9fe55eea7e4b ("Fix race when checking i_size on direct i/o read")
    Reviewed-by: Jan Kara
    Co-developed-by: Wang Long
    Signed-off-by: Wang Long
    Signed-off-by: Jiang Ying
    Signed-off-by: Greg Kroah-Hartman

    Jiang Ying
     

24 Jun, 2020

4 commits

  • [ Upstream commit 829b37b8cddb1db75c1b7905505b90e593b15db1 ]

    Trying to change dax mount options when remounting could allow mount
    options to be enabled for a small amount of time, and then the mount
    option change would be reverted.

    In the case of "mount -o remount,dax", this can cause a race where
    files would temporarily treated as DAX --- and then not.

    Cc: stable@kernel.org
    Reported-by: syzbot+bca9799bf129256190da@syzkaller.appspotmail.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Theodore Ts'o
     
  • commit 2ce3ee931a097e9720310db3f09c01c825a4580c upstream.

    If the dentry name passed to ->d_compare() fits in dentry::d_iname, then
    it may be concurrently modified by a rename. This can cause undefined
    behavior (possibly out-of-bounds memory accesses or crashes) in
    utf8_strncasecmp(), since fs/unicode/ isn't written to handle strings
    that may be concurrently modified.

    Fix this by first copying the filename to a stack buffer if needed.
    This way we get a stable snapshot of the filename.

    Fixes: b886ee3e778e ("ext4: Support case-insensitive file name lookups")
    Cc: # v5.2+
    Cc: Al Viro
    Cc: Daniel Rosenberg
    Cc: Gabriel Krisman Bertazi
    Signed-off-by: Eric Biggers
    Reviewed-by: Andreas Dilger
    Link: https://lore.kernel.org/r/20200601200543.59417-1-ebiggers@kernel.org
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • commit cfb3c85a600c6aa25a2581b3c1c4db3460f14e46 upstream.

    Fix the bug when calculating the physical block number of the first
    block in the split extent.

    This bug will cause xfstests shared/298 failure on ext4 with bigalloc
    enabled occasionally. Ext4 error messages indicate that previously freed
    blocks are being freed again, and the following fsck will fail due to
    the inconsistency of block bitmap and bg descriptor.

    The following is an example case:

    1. First, Initialize a ext4 filesystem with cluster size '16K', block size
    '4K', in which case, one cluster contains four blocks.

    2. Create one file (e.g., xxx.img) on this ext4 filesystem. Now the extent
    tree of this file is like:

    ...
    36864:[0]4:220160
    36868:[0]14332:145408
    51200:[0]2:231424
    ...

    3. Then execute PUNCH_HOLE fallocate on this file. The hole range is
    like:

    ..
    ext4_ext_remove_space: dev 254,16 ino 12 since 49506 end 49506 depth 1
    ext4_ext_remove_space: dev 254,16 ino 12 since 49544 end 49546 depth 1
    ext4_ext_remove_space: dev 254,16 ino 12 since 49605 end 49607 depth 1
    ...

    4. Then the extent tree of this file after punching is like

    ...
    49507:[0]37:158047
    49547:[0]58:158087
    ...

    5. Detailed procedure of punching hole [49544, 49546]

    5.1. The block address space:
    ```
    lblk ~49505 49506 49507~49543 49544~49546 49547~
    ---------+------+-------------+----------------+--------
    extent | hole | extent | hole | extent
    ---------+------+-------------+----------------+--------
    pblk ~158045 158046 158047~158083 158084~158086 158087~
    ```

    5.2. The detailed layout of cluster 39521:
    ```
    cluster 39521

    hole extent

    Reviewed-by: Eric Whitney
    Cc: stable@kernel.org # v3.19+
    Link: https://lore.kernel.org/r/1590121124-37096-1-git-send-email-jefflexu@linux.alibaba.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Jeffle Xu
     
  • [ Upstream commit 5adaccac46ea79008d7b75f47913f1a00f91d0ce ]

    Now the errcode from ext4_commit_super will overwrite EROFS exists in
    ext4_setup_super. Actually, no need to call ext4_commit_super since we
    will return EROFS. Fix it by goto done directly.

    Fixes: c89128a00838 ("ext4: handle errors on ext4_commit_super")
    Signed-off-by: yangerkun
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20200601073404.3712492-1-yangerkun@huawei.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    yangerkun
     

22 Jun, 2020

3 commits

  • commit 08adf452e628b0e2ce9a01048cfbec52353703d7 upstream.

    'igrab(d_inode(dentry->d_parent))' without holding dentry->d_lock is
    broken because without d_lock, d_parent can be concurrently changed due
    to a rename(). Then if the old directory is immediately deleted, old
    d_parent->inode can be NULL. That causes a NULL dereference in igrab().

    To fix this, use dget_parent() to safely grab a reference to the parent
    dentry, which pins the inode. This also eliminates the need to use
    d_find_any_alias() other than for the initial inode, as we no longer
    throw away the dentry at each step.

    This is an extremely hard race to hit, but it is possible. Adding a
    udelay() in between the reads of ->d_parent and its ->d_inode makes it
    reproducible on a no-journal filesystem using the following program:

    #include
    #include

    int main()
    {
    if (fork()) {
    for (;;) {
    mkdir("dir1", 0700);
    int fd = open("dir1/file", O_RDWR|O_CREAT|O_SYNC);
    write(fd, "X", 1);
    close(fd);
    }
    } else {
    mkdir("dir2", 0700);
    for (;;) {
    rename("dir1/file", "dir2/file");
    rmdir("dir1");
    }
    }
    }

    Fixes: d59729f4e794 ("ext4: fix races in ext4_sync_parent()")
    Cc: stable@vger.kernel.org
    Signed-off-by: Eric Biggers
    Link: https://lore.kernel.org/r/20200506183140.541194-1-ebiggers@kernel.org
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • commit 8418897f1bf87da0cb6936489d57a4320c32c0af upstream.

    Don't pass error pointers to brelse().

    commit 7159a986b420 ("ext4: fix some error pointer dereferences") has fixed
    some cases, fix the remaining one case.

    Once ext4_xattr_block_find()->ext4_sb_bread() failed, error pointer is
    stored in @bs->bh, which will be passed to brelse() in the cleanup
    routine of ext4_xattr_set_handle(). This will then cause a NULL panic
    crash in __brelse().

    BUG: unable to handle kernel NULL pointer dereference at 000000000000005b
    RIP: 0010:__brelse+0x1b/0x50
    Call Trace:
    ext4_xattr_set_handle+0x163/0x5d0
    ext4_xattr_set+0x95/0x110
    __vfs_setxattr+0x6b/0x80
    __vfs_setxattr_noperm+0x68/0x1b0
    vfs_setxattr+0xa0/0xb0
    setxattr+0x12c/0x1a0
    path_setxattr+0x8d/0xc0
    __x64_sys_setxattr+0x27/0x30
    do_syscall_64+0x60/0x250
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    In this case, @bs->bh stores '-EIO' actually.

    Fixes: fb265c9cb49e ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases")
    Signed-off-by: Jeffle Xu
    Reviewed-by: Joseph Qi
    Cc: stable@kernel.org # 2.6.19
    Reviewed-by: Ritesh Harjani
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/1587628004-95123-1-git-send-email-jefflexu@linux.alibaba.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Jeffle Xu
     
  • commit c36a71b4e35ab35340facdd6964a00956b9fef0a upstream.

    If eh->eh_max is 0, EXT_MAX_EXTENT/INDEX would evaluate to unsigned
    (-1) resulting in illegal memory accesses. Although there is no
    consistent repro, we see that generic/019 sometimes crashes because of
    this bug.

    Ran gce-xfstests smoke and verified that there were no regressions.

    Signed-off-by: Harshad Shirwadkar
    Link: https://lore.kernel.org/r/20200421023959.20879-2-harshadshirwadkar@gmail.com
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Harshad Shirwadkar
     

02 May, 2020

4 commits

  • commit f1eec3b0d0a849996ebee733b053efa71803dad5 upstream.

    While calculating overhead for internal journal, also check
    that j_inum shouldn't be 0. Otherwise we get below error with
    xfstests generic/050 with external journal (XXX_LOGDEV config) enabled.

    It could be simply reproduced with loop device with an external journal
    and marking blockdev as RO before mounting.

    [ 3337.146838] EXT4-fs error (device pmem1p2): ext4_get_journal_inode:4634: comm mount: inode #0: comm mount: iget: illegal inode #
    ------------[ cut here ]------------
    generic_make_request: Trying to write to read-only block-device pmem1p2 (partno 2)
    WARNING: CPU: 107 PID: 115347 at block/blk-core.c:788 generic_make_request_checks+0x6b4/0x7d0
    CPU: 107 PID: 115347 Comm: mount Tainted: G L --------- -t - 4.18.0-167.el8.ppc64le #1
    NIP: c0000000006f6d44 LR: c0000000006f6d40 CTR: 0000000030041dd4

    NIP [c0000000006f6d44] generic_make_request_checks+0x6b4/0x7d0
    LR [c0000000006f6d40] generic_make_request_checks+0x6b0/0x7d0

    Call Trace:
    generic_make_request_checks+0x6b0/0x7d0 (unreliable)
    generic_make_request+0x3c/0x420
    submit_bio+0xd8/0x200
    submit_bh_wbc+0x1e8/0x250
    __sync_dirty_buffer+0xd0/0x210
    ext4_commit_super+0x310/0x420 [ext4]
    __ext4_error+0xa4/0x1e0 [ext4]
    __ext4_iget+0x388/0xe10 [ext4]
    ext4_get_journal_inode+0x40/0x150 [ext4]
    ext4_calculate_overhead+0x5a8/0x610 [ext4]
    ext4_fill_super+0x3188/0x3260 [ext4]
    mount_bdev+0x778/0x8f0
    ext4_mount+0x28/0x50 [ext4]
    mount_fs+0x74/0x230
    vfs_kern_mount.part.6+0x6c/0x250
    do_mount+0x2fc/0x1280
    sys_mount+0x158/0x180
    system_call+0x5c/0x70
    EXT4-fs (pmem1p2): no journal found
    EXT4-fs (pmem1p2): can't get journal size
    EXT4-fs (pmem1p2): mounted filesystem without journal. Opts: dax,norecovery

    Fixes: 3c816ded78bb ("ext4: use journal inode to determine journal overhead")
    Reported-by: Harish Sriram
    Signed-off-by: Ritesh Harjani
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20200316093038.25485-1-riteshh@linux.ibm.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Ritesh Harjani
     
  • [ Upstream commit 907ea529fc4c3296701d2bfc8b831dd2a8121a34 ]

    If the in-core buddy bitmap gets corrupted (or out of sync with the
    block bitmap), issue a WARN_ON and try to recover. In most cases this
    involves skipping trying to allocate out of a particular block group.
    We can end up declaring the file system corrupted, which is fair,
    since the file system probably should be checked before we proceed any
    further.

    Link: https://lore.kernel.org/r/20200414035649.293164-1-tytso@mit.edu
    Google-Bug-Id: 34811296
    Google-Bug-Id: 34639169
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Theodore Ts'o
     
  • [ Upstream commit a17a9d935dc4a50acefaf319d58030f1da7f115a ]

    Current wait times have proven to be too short to protect against inode
    reuses that lead to metadata inconsistencies.

    Now that we will retry the inode allocation if we can't find any
    recently deleted inodes, it's a lot safer to increase the recently
    deleted time from 5 seconds to a minute.

    Link: https://lore.kernel.org/r/20200414023925.273867-1-tytso@mit.edu
    Google-Bug-Id: 36602237
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Theodore Ts'o
     
  • [ Upstream commit c2a559bc0e7ed5a715ad6b947025b33cb7c05ea7 ]

    Run generic/388 with journal data mode sometimes may trigger the warning
    in ext4_invalidatepage. Actually, we should use the matching invalidatepage
    in ext4_writepage.

    Signed-off-by: yangerkun
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Ritesh Harjani
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20200226041002.13914-1-yangerkun@huawei.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    yangerkun
     

29 Apr, 2020

1 commit

  • [ Upstream commit 4068664e3cd2312610ceac05b74c4cf1853b8325 ]

    Extents are cached in read_extent_tree_block(); as a result, extents
    are not cached for inodes with depth == 0 when we try to find the
    extent using ext4_find_extent(). The result of the lookup is cached
    in ext4_map_blocks() but is only a subset of the extent on disk. As a
    result, the contents of extents status cache can get very badly
    fragmented for certain workloads, such as a random 4k read workload.

    File size of /mnt/test is 33554432 (8192 blocks of 4096 bytes)
    ext: logical_offset: physical_offset: length: expected: flags:
    0: 0.. 8191: 40960.. 49151: 8192: last,eof

    $ perf record -e 'ext4:ext4_es_*' /root/bin/fio --name=t --direct=0 --rw=randread --bs=4k --filesize=32M --size=32M --filename=/mnt/test
    $ perf script | grep ext4_es_insert_extent | head -n 10
    fio 131 [000] 13.975421: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [494/1) mapped 41454 status W
    fio 131 [000] 13.975939: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6064/1) mapped 47024 status W
    fio 131 [000] 13.976467: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6907/1) mapped 47867 status W
    fio 131 [000] 13.976937: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3850/1) mapped 44810 status W
    fio 131 [000] 13.977440: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3292/1) mapped 44252 status W
    fio 131 [000] 13.977931: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [6882/1) mapped 47842 status W
    fio 131 [000] 13.978376: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [3117/1) mapped 44077 status W
    fio 131 [000] 13.978957: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [2896/1) mapped 43856 status W
    fio 131 [000] 13.979474: ext4:ext4_es_insert_extent: dev 253,0 ino 12 es [7479/1) mapped 48439 status W

    Fix this by caching the extents for inodes with depth == 0 in
    ext4_find_extent().

    [ Renamed ext4_es_cache_extents() to ext4_cache_extents() since this
    newly added function is not in extents_cache.c, and to avoid
    potential visual confusion with ext4_es_cache_extent(). -TYT ]

    Signed-off-by: Dmitry Monakhov
    Link: https://lore.kernel.org/r/20191106122502.19986-1-dmonakhov@gmail.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Dmitry Monakhov
     

23 Apr, 2020

2 commits

  • [ Upstream commit c96e2b8564adfb8ac14469ebc51ddc1bfecb3ae2 ]

    Under some circumstances we may encounter a filesystem error on a
    read-only block device, and if we try to save the error info to the
    superblock and commit it, we'll wind up with a noisy error and
    backtrace, i.e.:

    [ 3337.146838] EXT4-fs error (device pmem1p2): ext4_get_journal_inode:4634: comm mount: inode #0: comm mount: iget: illegal inode #
    ------------[ cut here ]------------
    generic_make_request: Trying to write to read-only block-device pmem1p2 (partno 2)
    WARNING: CPU: 107 PID: 115347 at block/blk-core.c:788 generic_make_request_checks+0x6b4/0x7d0
    ...

    To avoid this, commit the error info in the superblock only if the
    block device is writable.

    Reported-by: Ritesh Harjani
    Signed-off-by: Eric Sandeen
    Reviewed-by: Andreas Dilger
    Link: https://lore.kernel.org/r/4b6e774d-cc00-3469-7abb-108eb151071a@sandeen.net
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Eric Sandeen
     
  • commit d87f639258a6a5980183f11876c884931ad93da2 upstream.

    Since commit a8ac900b8163 ("ext4: use non-movable memory for the
    superblock") buffers for ext4 superblock were allocated using
    the sb_bread_unmovable() helper which allocated buffer heads
    out of non-movable memory blocks. It was necessarily to not block
    page migrations and do not cause cma allocation failures.

    However commit 85c8f176a611 ("ext4: preload block group descriptors")
    broke this by introducing pre-reading of the ext4 superblock.
    The problem is that __breadahead() is using __getblk() underneath,
    which allocates buffer heads out of movable memory.

    It resulted in page migration failures I've seen on a machine
    with an ext4 partition and a preallocated cma area.

    Fix this by introducing sb_breadahead_unmovable() and
    __breadahead_gfp() helpers which use non-movable memory for buffer
    head allocations and use them for the ext4 superblock readahead.

    Reviewed-by: Andreas Dilger
    Fixes: 85c8f176a611 ("ext4: preload block group descriptors")
    Signed-off-by: Roman Gushchin
    Link: https://lore.kernel.org/r/20200229001411.128010-1-guro@fb.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     

21 Apr, 2020

3 commits

  • commit 801674f34ecfed033b062a0f217506b93c8d5e8a upstream.

    We do not want to create initialized extents beyond end of file because
    for e2fsck it is impossible to distinguish them from a case of corrupted
    file size / extent tree and so it complains like:

    Inode 12, i_size is 147456, should be 163840. Fix? no

    Code in ext4_ext_convert_to_initialized() and
    ext4_split_convert_extents() try to make sure it does not create
    initialized extents beyond inode size however they check against
    inode->i_size which is wrong. They should instead check against
    EXT4_I(inode)->i_disksize which is the current inode size on disk.
    That's what e2fsck is going to see in case of crash before all dirty
    data is written. This bug manifests as generic/456 test failure (with
    recent enough fstests where fsx got fixed to properly pass
    FALLOC_KEEP_SIZE_FL flags to the kernel) when run with dioread_lock
    mount option.

    CC: stable@vger.kernel.org
    Fixes: 21ca087a3891 ("ext4: Do not zero out uninitialized extents beyond i_size")
    Reviewed-by: Lukas Czerner
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Link: https://lore.kernel.org/r/20200331105016.8674-1-jack@suse.cz
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit b9c538da4e52a7b79dfcf4cfa487c46125066dfb upstream.

    If ext4_fill_super detects an invalid number of inodes per group, the
    resulting error message printed the number of blocks per group, rather
    than the number of inodes per group. Fix it to print the correct value.

    Fixes: cd6bb35bf7f6d ("ext4: use more strict checks for inodes_per_block on mount")
    Link: https://lore.kernel.org/r/8be03355983a08e5d4eed480944613454d7e2550.1585434649.git.josh@joshtriplett.org
    Reviewed-by: Andreas Dilger
    Signed-off-by: Josh Triplett
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Josh Triplett
     
  • commit df41460a21b06a76437af040d90ccee03888e8e5 upstream.

    ext4_fill_super doublechecks the number of groups before mounting; if
    that check fails, the resulting error message prints the group count
    from the ext4_sb_info sbi, which hasn't been set yet. Print the freshly
    computed group count instead (which at that point has just been computed
    in "blocks_count").

    Signed-off-by: Josh Triplett
    Fixes: 4ec1102813798 ("ext4: Add sanity checks for the superblock before mounting the filesystem")
    Link: https://lore.kernel.org/r/8b957cd1513fcc4550fe675c10bcce2175c33a49.1585431964.git.josh@joshtriplett.org
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Josh Triplett
     

17 Apr, 2020

1 commit

  • commit 28936b62e71e41600bab319f262ea9f9b1027629 upstream.

    inode->i_blocks could be accessed concurrently as noticed by KCSAN,

    BUG: KCSAN: data-race in ext4_do_update_inode [ext4] / inode_add_bytes

    write to 0xffff9a00d4b982d0 of 8 bytes by task 22100 on cpu 118:
    inode_add_bytes+0x65/0xf0
    __inode_add_bytes at fs/stat.c:689
    (inlined by) inode_add_bytes at fs/stat.c:702
    ext4_mb_new_blocks+0x418/0xca0 [ext4]
    ext4_ext_map_blocks+0x1a6b/0x27b0 [ext4]
    ext4_map_blocks+0x1a9/0x950 [ext4]
    _ext4_get_block+0xfc/0x270 [ext4]
    ext4_get_block_unwritten+0x33/0x50 [ext4]
    __block_write_begin_int+0x22e/0xae0
    __block_write_begin+0x39/0x50
    ext4_write_begin+0x388/0xb50 [ext4]
    ext4_da_write_begin+0x35f/0x8f0 [ext4]
    generic_perform_write+0x15d/0x290
    ext4_buffered_write_iter+0x11f/0x210 [ext4]
    ext4_file_write_iter+0xce/0x9e0 [ext4]
    new_sync_write+0x29c/0x3b0
    __vfs_write+0x92/0xa0
    vfs_write+0x103/0x260
    ksys_write+0x9d/0x130
    __x64_sys_write+0x4c/0x60
    do_syscall_64+0x91/0xb05
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    read to 0xffff9a00d4b982d0 of 8 bytes by task 8 on cpu 65:
    ext4_do_update_inode+0x4a0/0xf60 [ext4]
    ext4_inode_blocks_set at fs/ext4/inode.c:4815
    ext4_mark_iloc_dirty+0xaf/0x160 [ext4]
    ext4_mark_inode_dirty+0x129/0x3e0 [ext4]
    ext4_convert_unwritten_extents+0x253/0x2d0 [ext4]
    ext4_convert_unwritten_io_end_vec+0xc5/0x150 [ext4]
    ext4_end_io_rsv_work+0x22c/0x350 [ext4]
    process_one_work+0x54f/0xb90
    worker_thread+0x80/0x5f0
    kthread+0x1cd/0x1f0
    ret_from_fork+0x27/0x50

    4 locks held by kworker/u256:0/8:
    #0: ffff9a025abc4328 ((wq_completion)ext4-rsv-conversion){+.+.}, at: process_one_work+0x443/0xb90
    #1: ffffab5a862dbe20 ((work_completion)(&ei->i_rsv_conversion_work)){+.+.}, at: process_one_work+0x443/0xb90
    #2: ffff9a025a9d0f58 (jbd2_handle){++++}, at: start_this_handle+0x1c1/0x9d0 [jbd2]
    #3: ffff9a00d4b985d8 (&(&ei->i_raw_lock)->rlock){+.+.}, at: ext4_do_update_inode+0xaa/0xf60 [ext4]
    irq event stamp: 3009267
    hardirqs last enabled at (3009267): [] __find_get_block+0x107/0x790
    hardirqs last disabled at (3009266): [] __find_get_block+0x49/0x790
    softirqs last enabled at (3009230): [] __do_softirq+0x34c/0x57c
    softirqs last disabled at (3009223): [] irq_exit+0xa2/0xc0

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 65 PID: 8 Comm: kworker/u256:0 Tainted: G L 5.6.0-rc2-next-20200221+ #7
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
    Workqueue: ext4-rsv-conversion ext4_end_io_rsv_work [ext4]

    The plain read is outside of inode->i_lock critical section which
    results in a data race. Fix it by adding READ_ONCE() there.

    Link: https://lore.kernel.org/r/20200222043258.2279-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Qian Cai
     

05 Mar, 2020

1 commit

  • commit 37b0b6b8b99c0e1c1f11abbe7cf49b6d03795b3f upstream.

    If sbi->s_flex_groups_allocated is zero and the first allocation fails
    then this code will crash. The problem is that "i--" will set "i" to
    -1 but when we compare "i >= sbi->s_flex_groups_allocated" then the -1
    is type promoted to unsigned and becomes UINT_MAX. Since UINT_MAX
    is more than zero, the condition is true so we call kvfree(new_groups[-1]).
    The loop will carry on freeing invalid memory until it crashes.

    Fixes: 7c990728b99e ("ext4: fix potential race between s_flex_groups online resizing and access")
    Reviewed-by: Suraj Jitindar Singh
    Signed-off-by: Dan Carpenter
    Cc: stable@kernel.org
    Link: https://lore.kernel.org/r/20200228092142.7irbc44yaz3by7nb@kili.mountain
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Dan Carpenter
     

29 Feb, 2020

8 commits

  • commit cb85f4d23f794e24127f3e562cb3b54b0803f456 upstream.

    If EXT4_EXTENTS_FL is set on an inode while ext4_writepages() is running
    on it, the following warning in ext4_add_complete_io() can be hit:

    WARNING: CPU: 1 PID: 0 at fs/ext4/page-io.c:234 ext4_put_io_end_defer+0xf0/0x120

    Here's a minimal reproducer (not 100% reliable) (root isn't required):

    while true; do
    sync
    done &
    while true; do
    rm -f file
    touch file
    chattr -e file
    echo X >> file
    chattr +e file
    done

    The problem is that in ext4_writepages(), ext4_should_dioread_nolock()
    (which only returns true on extent-based files) is checked once to set
    the number of reserved journal credits, and also again later to select
    the flags for ext4_map_blocks() and copy the reserved journal handle to
    ext4_io_end::handle. But if EXT4_EXTENTS_FL is being concurrently set,
    the first check can see dioread_nolock disabled while the later one can
    see it enabled, causing the reserved handle to unexpectedly be NULL.

    Since changing EXT4_EXTENTS_FL is uncommon, and there may be other races
    related to doing so as well, fix this by synchronizing changing
    EXT4_EXTENTS_FL with ext4_writepages() via the existing
    s_writepages_rwsem (previously called s_journal_flag_rwsem).

    This was originally reported by syzbot without a reproducer at
    https://syzkaller.appspot.com/bug?extid=2202a584a00fffd19fbf,
    but now that dioread_nolock is the default I also started seeing this
    when running syzkaller locally.

    Link: https://lore.kernel.org/r/20200219183047.47417-3-ebiggers@kernel.org
    Reported-by: syzbot+2202a584a00fffd19fbf@syzkaller.appspotmail.com
    Fixes: 6b523df4fb5a ("ext4: use transaction reservation for extent conversion in ext4_end_io")
    Signed-off-by: Eric Biggers
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • commit bbd55937de8f2754adc5792b0f8e5ff7d9c0420e upstream.

    In preparation for making s_journal_flag_rwsem synchronize
    ext4_writepages() with changes to both the EXTENTS and JOURNAL_DATA
    flags (rather than just JOURNAL_DATA as it does currently), rename it to
    s_writepages_rwsem.

    Link: https://lore.kernel.org/r/20200219183047.47417-2-ebiggers@kernel.org
    Signed-off-by: Eric Biggers
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Eric Biggers
     
  • commit 9db176bceb5c5df4990486709da386edadc6bd1d upstream.

    When CONFIG_QFMT_V2 is configured as a module, the test in
    ext4_feature_set_ok() fails and so mount of filesystems with quota or
    project features fails. Fix the test to use IS_ENABLED macro which
    works properly even for modules.

    Link: https://lore.kernel.org/r/20200221100835.9332-1-jack@suse.cz
    Fixes: d65d87a07476 ("ext4: improve explanation of a mount failure caused by a misconfigured kernel")
    Signed-off-by: Jan Kara
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 7c990728b99ed6fbe9c75fc202fce1172d9916da upstream.

    During an online resize an array of s_flex_groups structures gets replaced
    so it can get enlarged. If there is a concurrent access to the array and
    this memory has been reused then this can lead to an invalid memory access.

    The s_flex_group array has been converted into an array of pointers rather
    than an array of structures. This is to ensure that the information
    contained in the structures cannot get out of sync during a resize due to
    an accessor updating the value in the old structure after it has been
    copied but before the array pointer is updated. Since the structures them-
    selves are no longer copied but only the pointers to them this case is
    mitigated.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
    Link: https://lore.kernel.org/r/20200221053458.730016-4-tytso@mit.edu
    Signed-off-by: Suraj Jitindar Singh
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Suraj Jitindar Singh
     
  • commit df3da4ea5a0fc5d115c90d5aa6caa4dd433750a7 upstream.

    During an online resize an array of pointers to s_group_info gets replaced
    so it can get enlarged. If there is a concurrent access to the array in
    ext4_get_group_info() and this memory has been reused then this can lead to
    an invalid memory access.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
    Link: https://lore.kernel.org/r/20200221053458.730016-3-tytso@mit.edu
    Signed-off-by: Suraj Jitindar Singh
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Balbir Singh
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Suraj Jitindar Singh
     
  • commit 1d0c3924a92e69bfa91163bda83c12a994b4d106 upstream.

    During an online resize an array of pointers to buffer heads gets
    replaced so it can get enlarged. If there is a racing block
    allocation or deallocation which uses the old array, and the old array
    has gotten reused this can lead to a GPF or some other random kernel
    memory getting modified.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206443
    Link: https://lore.kernel.org/r/20200221053458.730016-2-tytso@mit.edu
    Reported-by: Suraj Jitindar Singh
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     
  • commit 9424ef56e13a1f14c57ea161eed3ecfdc7b2770e upstream.

    We tested a soft lockup problem in linux 4.19 which could also
    be found in linux 5.x.

    When dir inode takes up a large number of blocks, and if the
    directory is growing when we are searching, it's possible the
    restart branch could be called many times, and the do while loop
    could hold cpu a long time.

    Here is the call trace in linux 4.19.

    [ 473.756186] Call trace:
    [ 473.756196] dump_backtrace+0x0/0x198
    [ 473.756199] show_stack+0x24/0x30
    [ 473.756205] dump_stack+0xa4/0xcc
    [ 473.756210] watchdog_timer_fn+0x300/0x3e8
    [ 473.756215] __hrtimer_run_queues+0x114/0x358
    [ 473.756217] hrtimer_interrupt+0x104/0x2d8
    [ 473.756222] arch_timer_handler_virt+0x38/0x58
    [ 473.756226] handle_percpu_devid_irq+0x90/0x248
    [ 473.756231] generic_handle_irq+0x34/0x50
    [ 473.756234] __handle_domain_irq+0x68/0xc0
    [ 473.756236] gic_handle_irq+0x6c/0x150
    [ 473.756238] el1_irq+0xb8/0x140
    [ 473.756286] ext4_es_lookup_extent+0xdc/0x258 [ext4]
    [ 473.756310] ext4_map_blocks+0x64/0x5c0 [ext4]
    [ 473.756333] ext4_getblk+0x6c/0x1d0 [ext4]
    [ 473.756356] ext4_bread_batch+0x7c/0x1f8 [ext4]
    [ 473.756379] ext4_find_entry+0x124/0x3f8 [ext4]
    [ 473.756402] ext4_lookup+0x8c/0x258 [ext4]
    [ 473.756407] __lookup_hash+0x8c/0xe8
    [ 473.756411] filename_create+0xa0/0x170
    [ 473.756413] do_mkdirat+0x6c/0x140
    [ 473.756415] __arm64_sys_mkdirat+0x28/0x38
    [ 473.756419] el0_svc_common+0x78/0x130
    [ 473.756421] el0_svc_handler+0x38/0x78
    [ 473.756423] el0_svc+0x8/0xc
    [ 485.755156] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [tmp:5149]

    Add cond_resched() to avoid soft lockup and to provide a better
    system responding.

    Link: https://lore.kernel.org/r/20200215080206.13293-1-luoshijie1@huawei.com
    Signed-off-by: Shijie Luo
    Signed-off-by: Theodore Ts'o
    Reviewed-by: Jan Kara
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Shijie Luo
     
  • commit 35df4299a6487f323b0aca120ea3f485dfee2ae3 upstream.

    EXT4_I(inode)->i_disksize could be accessed concurrently as noticed by
    KCSAN,

    BUG: KCSAN: data-race in ext4_write_end [ext4] / ext4_writepages [ext4]

    write to 0xffff91c6713b00f8 of 8 bytes by task 49268 on cpu 127:
    ext4_write_end+0x4e3/0x750 [ext4]
    ext4_update_i_disksize at fs/ext4/ext4.h:3032
    (inlined by) ext4_update_inode_size at fs/ext4/ext4.h:3046
    (inlined by) ext4_write_end at fs/ext4/inode.c:1287
    generic_perform_write+0x208/0x2a0
    ext4_buffered_write_iter+0x11f/0x210 [ext4]
    ext4_file_write_iter+0xce/0x9e0 [ext4]
    new_sync_write+0x29c/0x3b0
    __vfs_write+0x92/0xa0
    vfs_write+0x103/0x260
    ksys_write+0x9d/0x130
    __x64_sys_write+0x4c/0x60
    do_syscall_64+0x91/0xb47
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    read to 0xffff91c6713b00f8 of 8 bytes by task 24872 on cpu 37:
    ext4_writepages+0x10ac/0x1d00 [ext4]
    mpage_map_and_submit_extent at fs/ext4/inode.c:2468
    (inlined by) ext4_writepages at fs/ext4/inode.c:2772
    do_writepages+0x5e/0x130
    __writeback_single_inode+0xeb/0xb20
    writeback_sb_inodes+0x429/0x900
    __writeback_inodes_wb+0xc4/0x150
    wb_writeback+0x4bd/0x870
    wb_workfn+0x6b4/0x960
    process_one_work+0x54c/0xbe0
    worker_thread+0x80/0x650
    kthread+0x1e0/0x200
    ret_from_fork+0x27/0x50

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 37 PID: 24872 Comm: kworker/u261:2 Tainted: G W O L 5.5.0-next-20200204+ #5
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
    Workqueue: writeback wb_workfn (flush-7:0)

    Since only the read is operating as lockless (outside of the
    "i_data_sem"), load tearing could introduce a logic bug. Fix it by
    adding READ_ONCE() for the read and WRITE_ONCE() for the write.

    Signed-off-by: Qian Cai
    Link: https://lore.kernel.org/r/1581085751-31793-1-git-send-email-cai@lca.pw
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Qian Cai
     

24 Feb, 2020

1 commit

  • [ Upstream commit 68e45330e341dad2d3a0a3f8ef2ec46a2a0a3bbc ]

    Without any form of coordination, any case where multiple allocations
    from the same mempool are needed at a time to make forward progress can
    deadlock under memory pressure.

    This is the case for struct bio_post_read_ctx, as one can be allocated
    to decrypt a Merkle tree page during fsverity_verify_bio(), which itself
    is running from a post-read callback for a data bio which has its own
    struct bio_post_read_ctx.

    Fix this by freeing the first bio_post_read_ctx before calling
    fsverity_verify_bio(). This works because verity (if enabled) is always
    the last post-read step.

    This deadlock can be reproduced by trying to read from an encrypted
    verity file after reducing NUM_PREALLOC_POST_READ_CTXS to 1 and patching
    mempool_alloc() to pretend that pool->alloc() always fails.

    Note that since NUM_PREALLOC_POST_READ_CTXS is actually 128, to actually
    hit this bug in practice would require reading from lots of encrypted
    verity files at the same time. But it's theoretically possible, as N
    available objects isn't enough to guarantee forward progress when > N/2
    threads each need 2 objects at a time.

    Fixes: 22cfe4b48ccb ("ext4: add fs-verity read support")
    Signed-off-by: Eric Biggers
    Link: https://lore.kernel.org/r/20191231181222.47684-1-ebiggers@kernel.org
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Eric Biggers