20 Jan, 2021

4 commits

  • commit 31e203e09f036f48e7c567c2d32df0196bbd303f upstream.

    After full/fast commit, entries in staging queue are promoted to main
    queue. In ext4_fs_cleanup function, it splice to staging queue to
    staging queue.

    Fixes: aa75f4d3daaeb ("ext4: main fast-commit commit path")
    Signed-off-by: Daejun Park
    Reviewed-by: Harshad Shirwadkar
    Link: https://lore.kernel.org/r/20201230094851epcms2p6eeead8cc984379b37b2efd21af90fd1a@epcms2p6
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Daejun Park
     
  • commit 23dd561ad9eae02b4d51bb502fe4e1a0666e9567 upstream.

    1: ext4_iget/ext4_find_extent never returns NULL, use IS_ERR
    instead of IS_ERR_OR_NULL to fix this.

    2: ext4_fc_replay_inode should set the inode to NULL when IS_ERR.
    and go to call iput properly.

    Fixes: 8016e29f4362 ("ext4: fast commit recovery path")
    Signed-off-by: Yi Li
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20201230033827.3996064-1-yili@winhong.com
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Yi Li
     
  • commit 29b665cc51e8b602bf2a275734349494776e3dbc upstream.

    Some extent io trees are initialized with NULL private member (e.g.
    btrfs_device::alloc_state and btrfs_fs_info::excluded_extents).
    Dereference of a NULL tree->private as inode pointer will cause panic.

    Pass tree->fs_info as it's known to be valid in all cases.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
    Fixes: 05912a3c04eb ("btrfs: drop extent_io_ops::tree_fs_info callback")
    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Anand Jain
    Signed-off-by: Su Yue
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Su Yue
     
  • commit 50e31ef486afe60f128d42fb9620e2a63172c15c upstream.

    [BUG]
    There are several bug reports about recent kernel unable to relocate
    certain data block groups.

    Sometimes the error just goes away, but there is one reporter who can
    reproduce it reliably.

    The dmesg would look like:

    [438.260483] BTRFS info (device dm-10): balance: start -dvrange=34625344765952..34625344765953
    [438.269018] BTRFS info (device dm-10): relocating block group 34625344765952 flags data|raid1
    [450.439609] BTRFS info (device dm-10): found 167 extents, stage: move data extents
    [463.501781] BTRFS info (device dm-10): balance: ended with status: -2

    [CAUSE]
    The ENOENT error is returned from the following call chain:

    add_data_references()
    |- delete_v1_space_cache();
    |- if (!found)
    return -ENOENT;

    The variable @found is set to true if we find a data extent whose
    disk bytenr matches parameter @data_bytes.

    With extra debugging, the offending tree block looks like this:

    leaf bytenr = 42676709441536, data_bytenr = 34626327621632

    ctime 1567904822.739884119 (2019-09-08 03:07:02)
    mtime 0.0 (1970-01-01 01:00:00)
    otime 0.0 (1970-01-01 01:00:00)
    item 27 key (51933 EXTENT_DATA 0) itemoff 9854 itemsize 53
    generation 1517381 type 2 (prealloc)
    prealloc data disk byte 34626327621632 nr 262144 <<<
    prealloc data offset 0 nr 262144
    item 28 key (52262 ROOT_ITEM 0) itemoff 9415 itemsize 439
    generation 2618893 root_dirid 256 bytenr 42677048360960 level 3 refs 1
    lastsnap 2618893 byte_limit 0 bytes_used 5557338112 flags 0x0(none)
    uuid d0d4361f-d231-6d40-8901-fe506e4b2b53

    Although item 27 has disk bytenr 34626327621632, which matches the
    data_bytenr, its type is prealloc, not reg.
    This makes the existing code skip that item, and return ENOENT.

    [FIX]
    The code is modified in commit 19b546d7a1b2 ("btrfs: relocation: Use
    btrfs_find_all_leafs to locate data extent parent tree leaves"), before
    that commit, we use something like

    "if (type == BTRFS_FILE_EXTENT_INLINE) continue;"

    But in that offending commit, we use (type == BTRFS_FILE_EXTENT_REG),
    ignoring BTRFS_FILE_EXTENT_PREALLOC.

    Fix it by also checking BTRFS_FILE_EXTENT_PREALLOC.

    Reported-by: Stéphane Lesimple
    Link: https://lore.kernel.org/linux-btrfs/505cabfa88575ed6dbe7cb922d8914fb@lesimple.fr
    Fixes: 19b546d7a1b2 ("btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves")
    CC: stable@vger.kernel.org # 5.6+
    Tested-By: Stéphane Lesimple
    Reviewed-by: Su Yue
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     

17 Jan, 2021

9 commits

  • commit 4f8b848788f77c7f5c3bd98febce66b7aa14785f upstream.

    When CRC32 is disabled, zonefs cannot be linked:

    ld: fs/zonefs/super.o: in function `zonefs_fill_super':

    Add a Kconfig 'select' statement for it.

    Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Damien Le Moal
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     
  • commit 2ca408d9c749c32288bc28725f9f12ba30299e8f upstream.

    Commit

    121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments")

    converted native x86-32 which take 64-bit arguments to use the
    compat handlers to allow conversion to passing args via pt_regs.
    sys_fanotify_mark() was however missed, as it has a general compat
    handler. Add a config option that will use the syscall wrapper that
    takes the split args for native 32-bit.

    [ bp: Fix typo in Kconfig help text. ]

    Fixes: 121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments")
    Reported-by: Paweł Jasiak
    Signed-off-by: Brian Gerst
    Signed-off-by: Borislav Petkov
    Acked-by: Jan Kara
    Acked-by: Andy Lutomirski
    Link: https://lkml.kernel.org/r/20201130223059.101286-1-brgerst@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    Brian Gerst
     
  • [ Upstream commit e076ab2a2ca70a0270232067cd49f76cd92efe64 ]

    Commit 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
    shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
    some infrastructure we have in place to flush inodes that we use for
    device replace and snapshot. However this introduced a pretty serious
    performance regression. To reproduce the user untarred the source
    tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
    see it take anywhere from 5 to 20 times as long to untar in 5.10
    compared to 5.9. This was observed on fast devices (SSD and better) and
    not on HDD.

    The root cause is because before we would generally use the normal
    writeback path to reclaim delalloc space, and for this we would provide
    it with the number of pages we wanted to flush. The referenced commit
    changed this to flush that many inodes, which drastically increased the
    amount of space we were flushing in certain cases, which severely
    affected performance.

    We cannot revert this patch unfortunately because of 3d45f221ce62
    ("btrfs: fix deadlock when cloning inline extent and low on free
    metadata space") which requires the ability to skip flushing inodes that
    are being cloned in certain scenarios, which means we need to keep using
    our flushing infrastructure or risk re-introducing the deadlock.

    Instead to fix this problem we can go back to providing
    btrfs_start_delalloc_roots with a number of pages to flush, and then set
    up a writeback_control and utilize sync_inode() to handle the flushing
    for us. This gives us the same behavior we had prior to the fix, while
    still allowing us to avoid the deadlock that was fixed by Filipe. I
    redid the users original test and got the following results on one of
    our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)

    5.9 0m54.258s
    5.10 1m26.212s
    5.10+patch 0m38.800s

    5.10+patch is significantly faster than plain 5.9 because of my patch
    series "Change data reservations to use the ticketing infra" which
    contained the patch that introduced the regression, but generally
    improved the overall ENOSPC flushing mechanisms.

    Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
    the results:

    5.10.5 4m00s
    5.10.5+patch 1m08s
    5.11-rc2 5m14s
    5.11-rc2+patch 1m30s

    Reported-by: René Rebe
    Fixes: 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
    CC: stable@vger.kernel.org # 5.10
    Signed-off-by: Josef Bacik
    Tested-by: David Sterba
    Reviewed-by: David Sterba
    [ add my test results ]
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik
     
  • [ Upstream commit 3d45f221ce627d13e2e6ef3274f06750c84a6542 ]

    When cloning an inline extent there are cases where we can not just copy
    the inline extent from the source range to the target range (e.g. when the
    target range starts at an offset greater than zero). In such cases we copy
    the inline extent's data into a page of the destination inode and then
    dirty that page. However, after that we will need to start a transaction
    for each processed extent and, if we are ever low on available metadata
    space, we may need to flush existing delalloc for all dirty inodes in an
    attempt to release metadata space - if that happens we may deadlock:

    * the async reclaim task queued a delalloc work to flush delalloc for
    the destination inode of the clone operation;

    * the task executing that delalloc work gets blocked waiting for the
    range with the dirty page to be unlocked, which is currently locked
    by the task doing the clone operation;

    * the async reclaim task blocks waiting for the delalloc work to complete;

    * the cloning task is waiting on the waitqueue of its reservation ticket
    while holding the range with the dirty page locked in the inode's
    io_tree;

    * if metadata space is not released by some other task (like delalloc for
    some other inode completing for example), the clone task waits forever
    and as a consequence the delalloc work and async reclaim tasks will hang
    forever as well. Releasing more space on the other hand may require
    starting a transaction, which will hang as well when trying to reserve
    metadata space, resulting in a deadlock between all these tasks.

    When this happens, traces like the following show up in dmesg/syslog:

    [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
    [87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
    [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
    [87452.326136] Call Trace:
    [87452.326737] __schedule+0x5d1/0xcf0
    [87452.327390] schedule+0x45/0xe0
    [87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs]
    [87452.328894] ? finish_wait+0x90/0x90
    [87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs]
    [87452.330133] ? __mod_memcg_state+0x8e/0x160
    [87452.330738] __extent_writepage+0x2d4/0x400 [btrfs]
    [87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs]
    [87452.332007] ? lock_release+0x20e/0x4c0
    [87452.332557] ? trace_hardirqs_on+0x1b/0xf0
    [87452.333127] extent_writepages+0x43/0x90 [btrfs]
    [87452.333653] ? lock_acquire+0x1a3/0x490
    [87452.334177] do_writepages+0x43/0xe0
    [87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100
    [87452.335720] __filemap_fdatawrite_range+0xc5/0x100
    [87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
    [87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs]
    [87452.337838] process_one_work+0x24e/0x5e0
    [87452.338437] worker_thread+0x50/0x3b0
    [87452.339137] ? process_one_work+0x5e0/0x5e0
    [87452.339884] kthread+0x153/0x170
    [87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0
    [87452.341153] ret_from_fork+0x22/0x30
    [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
    [87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
    [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
    [87452.345655] Call Trace:
    [87452.346305] __schedule+0x5d1/0xcf0
    [87452.346947] ? kvm_clock_read+0x14/0x30
    [87452.347676] ? wait_for_completion+0x81/0x110
    [87452.348389] schedule+0x45/0xe0
    [87452.349077] schedule_timeout+0x30c/0x580
    [87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [87452.350340] ? lock_acquire+0x1a3/0x490
    [87452.351006] ? try_to_wake_up+0x7a/0xa20
    [87452.351541] ? lock_release+0x20e/0x4c0
    [87452.352040] ? lock_acquired+0x199/0x490
    [87452.352517] ? wait_for_completion+0x81/0x110
    [87452.353000] wait_for_completion+0xab/0x110
    [87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs]
    [87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
    [87452.354455] flush_space+0x24f/0x660 [btrfs]
    [87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
    [87452.355565] process_one_work+0x24e/0x5e0
    [87452.356024] worker_thread+0x20f/0x3b0
    [87452.356487] ? process_one_work+0x5e0/0x5e0
    [87452.356973] kthread+0x153/0x170
    [87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0
    [87452.357880] ret_from_fork+0x22/0x30
    (...)
    < stack traces of several tasks waiting for the locks of the inodes of the
    clone operation >
    (...)
    [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
    [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
    [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
    [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
    [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
    [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
    [92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000
    [92867.447920] Call Trace:
    [92867.448435] __schedule+0x5d1/0xcf0
    [92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [92867.449423] schedule+0x45/0xe0
    [92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs]
    [92867.450576] ? finish_wait+0x90/0x90
    [92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
    [92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs]
    [92867.452412] start_transaction+0x2d1/0x760 [btrfs]
    [92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs]
    [92867.453848] ? lock_release+0x20e/0x4c0
    [92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
    [92867.455218] btrfs_clone+0x569/0x7e0 [btrfs]
    [92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs]
    [92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs]
    [92867.457213] do_clone_file_range+0xd4/0x1f0
    [92867.457828] vfs_clone_file_range+0x4d/0x230
    [92867.458355] ? lock_release+0x20e/0x4c0
    [92867.458890] ioctl_file_clone+0x8f/0xc0
    [92867.459377] do_vfs_ioctl+0x342/0x750
    [92867.459913] __x64_sys_ioctl+0x62/0xb0
    [92867.460377] do_syscall_64+0x33/0x80
    [92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    (...)
    < stack traces of more tasks blocked on metadata reservation like the clone
    task above, because the async reclaim task has deadlocked >
    (...)

    Another thing to notice is that the worker task that is deadlocked when
    trying to flush the destination inode of the clone operation is at
    btrfs_invalidatepage(). This is simply because the clone operation has a
    destination offset greater than the i_size and we only update the i_size
    of the destination file after cloning an extent (just like we do in the
    buffered write path).

    Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
    the flushing of delalloc for all inodes that have delalloc, add a runtime
    flag to an inode to signal it should not be flushed, and for inodes with
    that flag set, start_delalloc_inodes() will simply skip them. When the
    cloning code needs to dirty a page to copy an inline extent, set that flag
    on the inode and then clear it when the clone operation finishes.

    This could be sporadically triggered with test case generic/269 from
    fstests, which exercises many fsstress processes running in parallel with
    several dd processes filling up the entire filesystem.

    CC: stable@vger.kernel.org # 5.9+
    Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents")
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     
  • [ Upstream commit f2f121ab500d0457cc9c6f54269d21ffdf5bd304 ]

    Every time we log an inode we lookup in the fs/subvol tree for xattrs and
    if we have any, log them into the log tree. However it is very common to
    have inodes without any xattrs, so doing the search wastes times, but more
    importantly it adds contention on the fs/subvol tree locks, either making
    the logging code block and wait for tree locks or making the logging code
    making other concurrent operations block and wait.

    The most typical use cases where xattrs are used are when capabilities or
    ACLs are defined for an inode, or when SELinux is enabled.

    This change makes the logging code detect when an inode does not have
    xattrs and skip the xattrs search the next time the inode is logged,
    unless the inode is evicted and loaded again or a xattr is added to the
    inode. Therefore skipping the search for xattrs on inodes that don't ever
    have xattrs and are fsynced with some frequency.

    The following script that calls dbench was used to measure the impact of
    this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
    directly (no intermediary filesystem on the host) and using a non-debug
    kernel (default configuration on Debian distributions):

    $ cat test.sh
    #!/bin/bash

    DEV=/dev/sdk
    MNT=/mnt/sdk
    MOUNT_OPTIONS="-o ssd"

    mkfs.btrfs -f -m single -d single $DEV
    mount $MOUNT_OPTIONS $DEV $MNT

    dbench -D $MNT -t 200 40

    umount $MNT

    The results before this change:

    Operation Count AvgLat MaxLat
    ----------------------------------------
    NTCreateX 5761605 0.172 312.057
    Close 4232452 0.002 10.927
    Rename 243937 1.406 277.344
    Unlink 1163456 0.631 298.402
    Deltree 160 11.581 221.107
    Mkdir 80 0.003 0.005
    Qpathinfo 5221410 0.065 122.309
    Qfileinfo 915432 0.001 3.333
    Qfsinfo 957555 0.003 3.992
    Sfileinfo 469244 0.023 20.494
    Find 2018865 0.448 123.659
    WriteX 2874851 0.049 118.529
    ReadX 9030579 0.004 21.654
    LockX 18754 0.003 4.423
    UnlockX 18754 0.002 0.331
    Flush 403792 10.944 359.494

    Throughput 908.444 MB/sec 40 clients 40 procs max_latency=359.500 ms

    The results after this change:

    Operation Count AvgLat MaxLat
    ----------------------------------------
    NTCreateX 6442521 0.159 230.693
    Close 4732357 0.002 10.972
    Rename 272809 1.293 227.398
    Unlink 1301059 0.563 218.500
    Deltree 160 7.796 54.887
    Mkdir 80 0.008 0.478
    Qpathinfo 5839452 0.047 124.330
    Qfileinfo 1023199 0.001 4.996
    Qfsinfo 1070760 0.003 5.709
    Sfileinfo 524790 0.033 21.765
    Find 2257658 0.314 125.611
    WriteX 3211520 0.040 232.135
    ReadX 10098969 0.004 25.340
    LockX 20974 0.003 1.569
    UnlockX 20974 0.002 3.475
    Flush 451553 10.287 331.037

    Throughput 1011.77 MB/sec 40 clients 40 procs max_latency=331.045 ms

    +10.8% throughput, -8.2% max latency

    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     
  • [ Upstream commit 3e2224c5867fead6c0b94b84727cc676ac6353a3 ]

    alloc_fixed_file_ref_node() currently returns an ERR_PTR on failure.
    io_sqe_files_unregister() expects it to return NULL and since it can only
    return -ENOMEM, it makes more sense to change alloc_fixed_file_ref_node()
    to behave that way.

    Fixes: 1ffc54220c44 ("io_uring: fix io_sqe_files_unregister() hangs")
    Reported-by: Dan Carpenter
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Matthew Wilcox (Oracle)
     
  • commit 6c503150ae33ee19036255cfda0998463613352c upstream

    IOPOLL skips completion locking but keeps it under uring_lock, thus
    io_cqring_overflow_flush() and so io_cqring_events() need additional
    locking with uring_lock in some cases for IOPOLL.

    Remove __io_cqring_overflow_flush() from io_cqring_events(), introduce a
    wrapper around flush doing needed synchronisation and call it by hand.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     
  • commit 89448c47b8452b67c146dc6cad6f737e004c5caf upstream

    We don't need to take uring_lock for SQPOLL|IOPOLL to do
    io_cqring_overflow_flush() when cq_overflow_list is empty, remove it
    from the hot path.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     
  • commit 81b6d05ccad4f3d8a9dfb091fb46ad6978ee40e4 upstream

    io_req_task_submit() might be called for IOPOLL, do the fail path under
    uring_lock to comply with IOPOLL synchronisation based solely on it.

    Cc: stable@vger.kernel.org # 5.5+
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     

13 Jan, 2021

2 commits

  • commit 0b3f407e6728d990ae1630a02c7b952c21c288d3 upstream.

    When doing an incremental send, if we have a new inode that happens to
    have the same number that an old directory inode had in the base snapshot
    and that old directory has a pending rmdir operation, we end up computing
    a wrong path for the new inode, causing the receiver to fail.

    Example reproducer:

    $ cat test-send-rmdir.sh
    #!/bin/bash

    DEV=/dev/sdi
    MNT=/mnt/sdi

    mkfs.btrfs -f $DEV >/dev/null
    mount $DEV $MNT

    mkdir $MNT/dir
    touch $MNT/dir/file1
    touch $MNT/dir/file2
    touch $MNT/dir/file3

    # Filesystem looks like:
    #
    # . (ino 256)
    # |----- dir/ (ino 257)
    # |----- file1 (ino 258)
    # |----- file2 (ino 259)
    # |----- file3 (ino 260)
    #

    btrfs subvolume snapshot -r $MNT $MNT/snap1
    btrfs send -f /tmp/snap1.send $MNT/snap1

    # Now remove our directory and all its files.
    rm -fr $MNT/dir

    # Unmount the filesystem and mount it again. This is to ensure that
    # the next inode that is created ends up with the same inode number
    # that our directory "dir" had, 257, which is the first free "objectid"
    # available after mounting again the filesystem.
    umount $MNT
    mount $DEV $MNT

    # Now create a new file (it could be a directory as well).
    touch $MNT/newfile

    # Filesystem now looks like:
    #
    # . (ino 256)
    # |----- newfile (ino 257)
    #

    btrfs subvolume snapshot -r $MNT $MNT/snap2
    btrfs send -f /tmp/snap2.send -p $MNT/snap1 $MNT/snap2

    # Now unmount the filesystem, create a new one, mount it and try to apply
    # both send streams to recreate both snapshots.
    umount $DEV

    mkfs.btrfs -f $DEV >/dev/null

    mount $DEV $MNT

    btrfs receive -f /tmp/snap1.send $MNT
    btrfs receive -f /tmp/snap2.send $MNT

    umount $MNT

    When running the test, the receive operation for the incremental stream
    fails:

    $ ./test-send-rmdir.sh
    Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap1'
    At subvol /mnt/sdi/snap1
    Create a readonly snapshot of '/mnt/sdi' in '/mnt/sdi/snap2'
    At subvol /mnt/sdi/snap2
    At subvol snap1
    At snapshot snap2
    ERROR: chown o257-9-0 failed: No such file or directory

    So fix this by tracking directories that have a pending rmdir by inode
    number and generation number, instead of only inode number.

    A test case for fstests follows soon.

    Reported-by: Massimo B.
    Tested-by: Massimo B.
    Link: https://lore.kernel.org/linux-btrfs/6ae34776e85912960a253a8327068a892998e685.camel@gmx.net/
    CC: stable@vger.kernel.org # 4.19+
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit ae5e070eaca9dbebde3459dd8f4c2756f8c097d0 upstream.

    There is a chance of racing for qgroup flushing which may lead to
    deadlock:

    Thread A | Thread B
    (not holding trans handle) | (holding a trans handle)
    --------------------------------+--------------------------------
    __btrfs_qgroup_reserve_meta() | __btrfs_qgroup_reserve_meta()
    |- try_flush_qgroup() | |- try_flush_qgroup()
    |- QGROUP_FLUSHING bit set | |
    | | |- test_and_set_bit()
    | | |- wait_event()
    |- btrfs_join_transaction() |
    |- btrfs_commit_transaction()|

    !!! DEAD LOCK !!!

    Since thread A wants to commit transaction, but thread B is holding a
    transaction handle, blocking the commit.
    At the same time, thread B is waiting for thread A to finish its commit.

    This is just a hot fix, and would lead to more EDQUOT when we're near
    the qgroup limit.

    The proper fix would be to make all metadata/data reservations happen
    without holding a transaction handle.

    CC: stable@vger.kernel.org # 5.9+
    Reviewed-by: Filipe Manana
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     

09 Jan, 2021

2 commits

  • [ Upstream commit f7cfd871ae0c5008d94b6f66834e7845caa93c15 ]

    Recently syzbot reported[0] that there is a deadlock amongst the users
    of exec_update_mutex. The problematic lock ordering found by lockdep
    was:

    perf_event_open (exec_update_mutex -> ovl_i_mutex)
    chown (ovl_i_mutex -> sb_writes)
    sendfile (sb_writes -> p->lock)
    by reading from a proc file and writing to overlayfs
    proc_pid_syscall (p->lock -> exec_update_mutex)

    While looking at possible solutions it occured to me that all of the
    users and possible users involved only wanted to state of the given
    process to remain the same. They are all readers. The only writer is
    exec.

    There is no reason for readers to block on each other. So fix
    this deadlock by transforming exec_update_mutex into a rw_semaphore
    named exec_update_lock that only exec takes for writing.

    Cc: Jann Horn
    Cc: Vasiliy Kulikov
    Cc: Al Viro
    Cc: Bernd Edlinger
    Cc: Oleg Nesterov
    Cc: Christopher Yeoh
    Cc: Cyrill Gorcunov
    Cc: Sargun Dhillon
    Cc: Christian Brauner
    Cc: Arnd Bergmann
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Fixes: eea9673250db ("exec: Add exec_update_mutex to replace cred_guard_mutex")
    [0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
    Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
    Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sasha Levin

    Eric W. Biederman
     
  • [ Upstream commit 5d069dbe8aaf2a197142558b6fb2978189ba3454 ]

    Jan Kara's analysis of the syzbot report (edited):

    The reproducer opens a directory on FUSE filesystem, it then attaches
    dnotify mark to the open directory. After that a fuse_do_getattr() call
    finds that attributes returned by the server are inconsistent, and calls
    make_bad_inode() which, among other things does:

    inode->i_mode = S_IFREG;

    This then confuses dnotify which doesn't tear down its structures
    properly and eventually crashes.

    Avoid calling make_bad_inode() on a live inode: switch to a private flag on
    the fuse inode. Also add the test to ops which the bad_inode_ops would
    have caught.

    This bug goes back to the initial merge of fuse in 2.6.14...

    Reported-by: syzbot+f427adf9324b92652ccc@syzkaller.appspotmail.com
    Signed-off-by: Miklos Szeredi
    Tested-by: Jan Kara
    Cc:
    Signed-off-by: Sasha Levin

    Miklos Szeredi
     

06 Jan, 2021

22 commits

  • [ Upstream commit 82ef1370b0c1757ab4ce29f34c52b4e93839b0aa ]

    Commit cfd732377221 ("ext4: add prefetching for block allocation
    bitmaps") introduced block bitmap prefetch, and expects to read block
    bitmaps of flex_bg through an IO. However, it seems to ignore the
    value range of s_log_groups_per_flex. In the scenario where the value
    of s_log_groups_per_flex is greater than 27, s_mb_prefetch or
    s_mb_prefetch_limit will overflow, cause a divide zero exception.

    In addition, the logic of calculating nr is also flawed, because the
    size of flexbg is fixed during a single mount, but s_mb_prefetch can
    be modified, which causes nr to fail to meet the value condition of
    [1, flexbg_size].

    To solve this problem, we need to set the upper limit of
    s_mb_prefetch. Since we expect to load block bitmaps of a flex_bg
    through an IO, we can consider determining a reasonable upper limit
    among the IO limit parameters. After consideration, we chose
    BLK_MAX_SEGMENT_SIZE. This is a good choice to solve divide zero
    problem and avoiding performance degradation.

    [ Some minor code simplifications to make the changes easy to follow -- TYT ]

    Reported-by: Tosk Robot
    Signed-off-by: Chunguang Xu
    Reviewed-by: Samuel Liao
    Reviewed-by: Andreas Dilger
    Link: https://lore.kernel.org/r/1607051143-24508-1-git-send-email-brookxu@tencent.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Chunguang Xu
     
  • [ Upstream commit 9cd2be519d05ee78876d55e8e902b7125f78b74f ]

    list_empty_careful() is not racy only if some conditions are met, i.e.
    no re-adds after del_init. io_cqring_overflow_flush() does list_move(),
    so it's actually racy.

    Remove those checks, we have ->cq_check_overflow for the fast path.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     
  • [ Upstream commit 68cbb8056a4c24c6a38ad2b79e0a9764b235e8fa ]

    Signed-off-by: Jeff Layton
    Reviewed-by: Ilya Dryomov
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Sasha Levin

    Jeff Layton
     
  • [ Upstream commit 503b934a752f7e789a5f33217520e0a79f3096ac ]

    Expanding the READ_PLUS extents can cause the read buffer to overflow.
    If it does, then don't error, but just exit early.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Trond Myklebust
     
  • [ Upstream commit edf7ddbf1c5eb98b720b063b73e20e8a4a1ce673 ]

    Missing calls to mntget() (or equivalently, too many calls to mntput())
    are hard to detect because mntput() delays freeing mounts using
    task_work_add(), then again using call_rcu(). As a result, mnt_count
    can often be decremented to -1 without getting a KASAN use-after-free
    report. Such cases are still bugs though, and they point to real
    use-after-frees being possible.

    For an example of this, see the bug fixed by commit 1b0b9cc8d379
    ("vfs: fsmount: add missing mntget()"), discussed at
    https://lkml.kernel.org/linux-fsdevel/20190605135401.GB30925@xxxxxxxxxxxxxxxxxxxxxxxxx/T/#u.
    This bug *should* have been trivial to find. But actually, it wasn't
    found until syzkaller happened to use fchdir() to manipulate the
    reference count just right for the bug to be noticeable.

    Address this by making mntput_no_expire() issue a WARN if mnt_count has
    become negative.

    Suggested-by: Miklos Szeredi
    Signed-off-by: Eric Biggers
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin

    Eric Biggers
     
  • [ Upstream commit 6422a71ef40e4751d59b8c9412e7e2dafe085878 ]

    I found out f2fs_free_dic() is invoked in a wrong timing, but
    f2fs_verify_bio() still needed the dic info and it triggered the
    below kernel panic. It has been caused by the race condition of
    pending_pages value between decompression and verity logic, when
    the same compression cluster had been split in different bios.
    By split bios, f2fs_verify_bio() ended up with decreasing
    pending_pages value before it is reset to nr_cpages by
    f2fs_decompress_pages() and caused the kernel panic.

    [ 4416.564763] Unable to handle kernel NULL pointer dereference
    at virtual address 0000000000000000
    ...
    [ 4416.896016] Workqueue: fsverity_read_queue f2fs_verity_work
    [ 4416.908515] pc : fsverity_verify_page+0x20/0x78
    [ 4416.913721] lr : f2fs_verify_bio+0x11c/0x29c
    [ 4416.913722] sp : ffffffc019533cd0
    [ 4416.913723] x29: ffffffc019533cd0 x28: 0000000000000402
    [ 4416.913724] x27: 0000000000000001 x26: 0000000000000100
    [ 4416.913726] x25: 0000000000000001 x24: 0000000000000004
    [ 4416.913727] x23: 0000000000001000 x22: 0000000000000000
    [ 4416.913728] x21: 0000000000000000 x20: ffffffff2076f9c0
    [ 4416.913729] x19: ffffffff2076f9c0 x18: ffffff8a32380c30
    [ 4416.913731] x17: ffffffc01f966d97 x16: 0000000000000298
    [ 4416.913732] x15: 0000000000000000 x14: 0000000000000000
    [ 4416.913733] x13: f074faec89ffffff x12: 0000000000000000
    [ 4416.913734] x11: 0000000000001000 x10: 0000000000001000
    [ 4416.929176] x9 : ffffffff20d1f5c7 x8 : 0000000000000000
    [ 4416.929178] x7 : 626d7464ff286b6b x6 : ffffffc019533ade
    [ 4416.929179] x5 : 000000008049000e x4 : ffffffff2793e9e0
    [ 4416.929180] x3 : 000000008049000e x2 : ffffff89ecfa74d0
    [ 4416.929181] x1 : 0000000000000c40 x0 : ffffffff2076f9c0
    [ 4416.929184] Call trace:
    [ 4416.929187] fsverity_verify_page+0x20/0x78
    [ 4416.929189] f2fs_verify_bio+0x11c/0x29c
    [ 4416.929192] f2fs_verity_work+0x58/0x84
    [ 4417.050667] process_one_work+0x270/0x47c
    [ 4417.055354] worker_thread+0x27c/0x4d8
    [ 4417.059784] kthread+0x13c/0x320
    [ 4417.063693] ret_from_fork+0x10/0x18

    Chao pointed this can happen by the below race condition.

    Thread A f2fs_post_read_wq fsverity_wq
    - f2fs_read_multi_pages()
    - f2fs_alloc_dic
    - dic->pending_pages = 2
    - submit_bio()
    - submit_bio()
    - f2fs_post_read_work() handle first bio
    - f2fs_decompress_work()
    - __read_end_io()
    - f2fs_decompress_pages()
    - dic->pending_pages--
    - enqueue f2fs_verity_work()
    - f2fs_verity_work() handle first bio
    - f2fs_verify_bio()
    - dic->pending_pages--
    - f2fs_post_read_work() handle second bio
    - f2fs_decompress_work()
    - enqueue f2fs_verity_work()
    - f2fs_verify_pages()
    - f2fs_free_dic()

    - f2fs_verity_work() handle second bio
    - f2fs_verfy_bio()
    - use-after-free on dic

    Signed-off-by: Daeho Jeong
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin

    Daeho Jeong
     
  • [ Upstream commit a95ba66ac1457b76fe472c8e092ab1006271f16c ]

    Light reported sometimes shinker gets nat_cnt < dirty_nat_cnt resulting in
    wrong do_shinker work. Let's avoid to return insane overflowed value by adding
    single tracking value.

    Reported-by: Light Hsieh
    Reviewed-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Sasha Levin

    Jaegeuk Kim
     
  • [ Upstream commit b6d49ecd1081740b6e632366428b960461f8158b ]

    When returning the layout in nfs4_evict_inode(), we need to ensure that
    the layout is actually done being freed before we can proceed to free the
    inode itself.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Trond Myklebust
     
  • [ Upstream commit 10f04d40a9fa29785206c619f80d8beedb778837 ]

    The on-disk quota format supports quota files with upto 2^32 blocks. Be
    careful when computing quota file offsets in the quota files from block
    numbers as they can overflow 32-bit types. Since quota files larger than
    4GB would require ~26 millions of quota users, this is mostly a
    theoretical concern now but better be careful, fuzzers would find the
    problem sooner or later anyway...

    Reviewed-by: Andreas Dilger
    Signed-off-by: Jan Kara
    Signed-off-by: Sasha Levin

    Jan Kara
     
  • commit 65b2b213484acd89a3c20dbb524e52a2f3793b78 upstream.

    syzbot reports following issue:
    INFO: task syz-executor.2:12399 can't die for more than 143 seconds.
    task:syz-executor.2 state:D stack:28744 pid:12399 ppid: 8504 flags:0x00004004
    Call Trace:
    context_switch kernel/sched/core.c:3773 [inline]
    __schedule+0x893/0x2170 kernel/sched/core.c:4522
    schedule+0xcf/0x270 kernel/sched/core.c:4600
    schedule_timeout+0x1d8/0x250 kernel/time/timer.c:1847
    do_wait_for_common kernel/sched/completion.c:85 [inline]
    __wait_for_common kernel/sched/completion.c:106 [inline]
    wait_for_common kernel/sched/completion.c:117 [inline]
    wait_for_completion+0x163/0x260 kernel/sched/completion.c:138
    kthread_stop+0x17a/0x720 kernel/kthread.c:596
    io_put_sq_data fs/io_uring.c:7193 [inline]
    io_sq_thread_stop+0x452/0x570 fs/io_uring.c:7290
    io_finish_async fs/io_uring.c:7297 [inline]
    io_sq_offload_create fs/io_uring.c:8015 [inline]
    io_uring_create fs/io_uring.c:9433 [inline]
    io_uring_setup+0x19b7/0x3730 fs/io_uring.c:9507
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x45deb9
    Code: Unable to access opcode bytes at RIP 0x45de8f.
    RSP: 002b:00007f174e51ac78 EFLAGS: 00000246 ORIG_RAX: 00000000000001a9
    RAX: ffffffffffffffda RBX: 0000000000008640 RCX: 000000000045deb9
    RDX: 0000000000000000 RSI: 0000000020000140 RDI: 00000000000050e5
    RBP: 000000000118bf58 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118bf2c
    R13: 00007ffed9ca723f R14: 00007f174e51b9c0 R15: 000000000118bf2c
    INFO: task syz-executor.2:12399 blocked for more than 143 seconds.
    Not tainted 5.10.0-rc3-next-20201110-syzkaller #0
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

    Currently we don't have a reproducer yet, but seems that there is a
    race in current codes:
    => io_put_sq_data
    ctx_list is empty now. |
    ==> kthread_park(sqd->thread); |
    | T1: sq thread is parked now.
    ==> kthread_stop(sqd->thread); |
    KTHREAD_SHOULD_STOP is set now.|
    ===> kthread_unpark(k); |
    | T2: sq thread is now unparkd, run again.
    |
    | T3: sq thread is now preempted out.
    |
    ===> wake_up_process(k); |
    |
    | T4: Since sqd ctx_list is empty, needs_sched will be true,
    | then sq thread sets task state to TASK_INTERRUPTIBLE,
    | and schedule, now sq thread will never be waken up.
    ===> wait_for_completion |

    I have artificially used mdelay() to simulate above race, will get same
    stack like this syzbot report, but to be honest, I'm not sure this code
    race triggers syzbot report.

    To fix this possible code race, when sq thread is unparked, need to check
    whether sq thread has been stopped.

    Reported-by: syzbot+03beeb595f074db9cfd1@syzkaller.appspotmail.com
    Signed-off-by: Xiaoguang Wang
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Xiaoguang Wang
     
  • commit 8d1ddb5e79374fb277985a6b3faa2ed8631c5b4c upstream.

    Syzbot reports a potential deadlock found by the newly added recursive
    read deadlock detection in lockdep:

    [...] ========================================================
    [...] WARNING: possible irq lock inversion dependency detected
    [...] 5.9.0-rc2-syzkaller #0 Not tainted
    [...] --------------------------------------------------------
    [...] syz-executor.1/10214 just changed the state of lock:
    [...] ffff88811f506338 (&f->f_owner.lock){.+..}-{2:2}, at: send_sigurg+0x1d/0x200
    [...] but this lock was taken by another, HARDIRQ-safe lock in the past:
    [...] (&dev->event_lock){-...}-{2:2}
    [...]
    [...]
    [...] and interrupts could create inverse lock ordering between them.
    [...]
    [...]
    [...] other info that might help us debug this:
    [...] Chain exists of:
    [...] &dev->event_lock --> &new->fa_lock --> &f->f_owner.lock
    [...]
    [...] Possible interrupt unsafe locking scenario:
    [...]
    [...] CPU0 CPU1
    [...] ---- ----
    [...] lock(&f->f_owner.lock);
    [...] local_irq_disable();
    [...] lock(&dev->event_lock);
    [...] lock(&new->fa_lock);
    [...]
    [...] lock(&dev->event_lock);
    [...]
    [...] *** DEADLOCK ***

    The corresponding deadlock case is as followed:

    CPU 0 CPU 1 CPU 2
    read_lock(&fown->lock);
    spin_lock_irqsave(&dev->event_lock, ...)
    write_lock_irq(&filp->f_owner.lock); // wait for the lock
    read_lock(&fown-lock); // have to wait until the writer release
    // due to the fairness

    spin_lock_irqsave(&dev->event_lock); // wait for the lock

    The lock dependency on CPU 1 happens if there exists a call sequence:

    input_inject_event():
    spin_lock_irqsave(&dev->event_lock,...);
    input_handle_event():
    input_pass_values():
    input_to_handler():
    handler->event(): // evdev_event()
    evdev_pass_values():
    spin_lock(&client->buffer_lock);
    __pass_event():
    kill_fasync():
    kill_fasync_rcu():
    read_lock(&fa->fa_lock);
    send_sigio():
    read_lock(&fown->lock);

    To fix this, make the reader in send_sigurg() and send_sigio() use
    read_lock_irqsave() and read_lock_irqrestore().

    Reported-by: syzbot+22e87cdf94021b984aa6@syzkaller.appspotmail.com
    Reported-by: syzbot+c5e32344981ad9f33750@syzkaller.appspotmail.com
    Signed-off-by: Boqun Feng
    Signed-off-by: Jeff Layton
    Signed-off-by: Greg Kroah-Hartman

    Boqun Feng
     
  • commit c9200760da8a728eb9767ca41a956764b28c1310 upstream.

    Check for valid block size directly by validating s_log_block_size; we
    were doing this in two places. First, by calculating blocksize via
    BLOCK_SIZE << s_log_block_size, and then checking that the blocksize
    was valid. And then secondly, by checking s_log_block_size directly.

    The first check is not reliable, and can trigger an UBSAN warning if
    s_log_block_size on a maliciously corrupted superblock is greater than
    22. This is harmless, since the second test will correctly reject the
    maliciously fuzzed file system, but to make syzbot shut up, and
    because the two checks are duplicative in any case, delete the
    blocksize check, and move the s_log_block_size earlier in
    ext4_fill_super().

    Signed-off-by: Theodore Ts'o
    Reported-by: syzbot+345b75652b1d24227443@syzkaller.appspotmail.com
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     
  • commit dc889b8d4a8122549feabe99eead04e6b23b6513 upstream.

    Make the printk() [bfs "printf" macro] seem less severe by changing
    "WARNING:" to "NOTE:".

    warns us about using WARNING or BUG in a format string
    other than in WARN() or BUG() family macros. bfs/inode.c is doing just
    that in a normal printk() call, so change the "WARNING" string to be
    "NOTE".

    Link: https://lkml.kernel.org/r/20201203212634.17278-1-rdunlap@infradead.org
    Reported-by: syzbot+3fd34060f26e766536ff@syzkaller.appspotmail.com
    Signed-off-by: Randy Dunlap
    Cc: Dmitry Vyukov
    Cc: Al Viro
    Cc: "Tigran A. Aivazian"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Randy Dunlap
     
  • commit e584bbe821229a3e7cc409eecd51df66f9268c21 upstream.

    syzbot reported a bug which could cause shift-out-of-bounds issue,
    fix it.

    Call Trace:
    __dump_stack lib/dump_stack.c:79 [inline]
    dump_stack+0x107/0x163 lib/dump_stack.c:120
    ubsan_epilogue+0xb/0x5a lib/ubsan.c:148
    __ubsan_handle_shift_out_of_bounds.cold+0xb1/0x181 lib/ubsan.c:395
    sanity_check_raw_super fs/f2fs/super.c:2812 [inline]
    read_raw_super_block fs/f2fs/super.c:3267 [inline]
    f2fs_fill_super.cold+0x16c9/0x16f6 fs/f2fs/super.c:3519
    mount_bdev+0x34d/0x410 fs/super.c:1366
    legacy_get_tree+0x105/0x220 fs/fs_context.c:592
    vfs_get_tree+0x89/0x2f0 fs/super.c:1496
    do_new_mount fs/namespace.c:2896 [inline]
    path_mount+0x12ae/0x1e70 fs/namespace.c:3227
    do_mount fs/namespace.c:3240 [inline]
    __do_sys_mount fs/namespace.c:3448 [inline]
    __se_sys_mount fs/namespace.c:3425 [inline]
    __x64_sys_mount+0x27f/0x300 fs/namespace.c:3425
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reported-by: syzbot+ca9a785f8ac472085994@syzkaller.appspotmail.com
    Signed-off-by: Anant Thazhemadam
    Signed-off-by: Chao Yu
    Signed-off-by: Jaegeuk Kim
    Signed-off-by: Greg Kroah-Hartman

    Chao Yu
     
  • commit d24396c5290ba8ab04ba505176874c4e04a2d53c upstream.

    when directory item has an invalid value set for ih_entry_count it might
    trigger use-after-free or out-of-bounds read in bin_search_in_dir_item()

    ih_entry_count * IH_SIZE for directory item should not be larger than
    ih_item_len

    Link: https://lore.kernel.org/r/20201101140958.3650143-1-rkovhaev@gmail.com
    Reported-and-tested-by: syzbot+83b6f7cf9922cae5c4d7@syzkaller.appspotmail.com
    Link: https://syzkaller.appspot.com/bug?extid=83b6f7cf9922cae5c4d7
    Signed-off-by: Rustam Kovhaev
    Signed-off-by: Jan Kara
    Signed-off-by: Greg Kroah-Hartman

    Rustam Kovhaev
     
  • commit 1ffc54220c444774b7f09e6d2121e732f8e19b94 upstream.

    io_sqe_files_unregister() uninterruptibly waits for enqueued ref nodes,
    however requests keeping them may never complete, e.g. because of some
    userspace dependency. Make sure it's interruptible otherwise it would
    hang forever.

    Cc: stable@vger.kernel.org # 5.6+
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Pavel Begunkov
     
  • commit 1642b4450d20e31439c80c28256c8eee08684698 upstream.

    Setting a new reference node to a file data is not trivial, don't repeat
    it, add and use a helper.

    Cc: stable@vger.kernel.org # 5.6+
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Pavel Begunkov
     
  • commit ac0648a56c1ff66c1cbf735075ad33a26cbc50de upstream.

    io_file_data_ref_zero() can be invoked from soft-irq from the RCU core,
    hence we need to ensure that the file_data lock is bottom half safe. Use
    the _bh() variants when grabbing this lock.

    Reported-by: syzbot+1f4ba1e5520762c523c6@syzkaller.appspotmail.com
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     
  • commit 77788775c7132a8d93c6930ab1bd84fc743c7cb7 upstream.

    If we COW the identity, we assume that ->mm never changes. But this
    isn't true of multiple processes end up sharing the ring. Hence treat
    id->mm like like any other process compontent when it comes to the
    identity mapping. This is pretty trivial, just moving the existing grab
    into io_grab_identity(), and including a check for the match.

    Cc: stable@vger.kernel.org # 5.10
    Fixes: 1e6fa5216a0e ("io_uring: COW io_identity on mismatch")
    Reported-by: Christian Brauner :
    Tested-by: Christian Brauner :
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Jens Axboe
     
  • [ Upstream commit a61df3c413e49b0042f9caf774c58512d1cc71b7 ]

    syzkaller found the following JFFS2 splat:

    Unable to handle kernel paging request at virtual address dfffa00000000001
    Mem abort info:
    ESR = 0x96000004
    EC = 0x25: DABT (current EL), IL = 32 bits
    SET = 0, FnV = 0
    EA = 0, S1PTW = 0
    Data abort info:
    ISV = 0, ISS = 0x00000004
    CM = 0, WnR = 0
    [dfffa00000000001] address between user and kernel address ranges
    Internal error: Oops: 96000004 [#1] SMP
    Dumping ftrace buffer:
    (ftrace buffer empty)
    Modules linked in:
    CPU: 0 PID: 12745 Comm: syz-executor.5 Tainted: G S 5.9.0-rc8+ #98
    Hardware name: linux,dummy-virt (DT)
    pstate: 20400005 (nzCv daif +PAN -UAO BTYPE=--)
    pc : jffs2_parse_param+0x138/0x308 fs/jffs2/super.c:206
    lr : jffs2_parse_param+0x108/0x308 fs/jffs2/super.c:205
    sp : ffff000022a57910
    x29: ffff000022a57910 x28: 0000000000000000
    x27: ffff000057634008 x26: 000000000000d800
    x25: 000000000000d800 x24: ffff0000271a9000
    x23: ffffa0001adb5dc0 x22: ffff000023fdcf00
    x21: 1fffe0000454af2c x20: ffff000024cc9400
    x19: 0000000000000000 x18: 0000000000000000
    x17: 0000000000000000 x16: ffffa000102dbdd0
    x15: 0000000000000000 x14: ffffa000109e44bc
    x13: ffffa00010a3a26c x12: ffff80000476e0b3
    x11: 1fffe0000476e0b2 x10: ffff80000476e0b2
    x9 : ffffa00010a3ad60 x8 : ffff000023b70593
    x7 : 0000000000000003 x6 : 00000000f1f1f1f1
    x5 : ffff000023fdcf00 x4 : 0000000000000002
    x3 : ffffa00010000000 x2 : 0000000000000001
    x1 : dfffa00000000000 x0 : 0000000000000008
    Call trace:
    jffs2_parse_param+0x138/0x308 fs/jffs2/super.c:206
    vfs_parse_fs_param+0x234/0x4e8 fs/fs_context.c:117
    vfs_parse_fs_string+0xe8/0x148 fs/fs_context.c:161
    generic_parse_monolithic+0x17c/0x208 fs/fs_context.c:201
    parse_monolithic_mount_data+0x7c/0xa8 fs/fs_context.c:649
    do_new_mount fs/namespace.c:2871 [inline]
    path_mount+0x548/0x1da8 fs/namespace.c:3192
    do_mount+0x124/0x138 fs/namespace.c:3205
    __do_sys_mount fs/namespace.c:3413 [inline]
    __se_sys_mount fs/namespace.c:3390 [inline]
    __arm64_sys_mount+0x164/0x238 fs/namespace.c:3390
    __invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
    invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
    el0_svc_common.constprop.0+0x15c/0x598 arch/arm64/kernel/syscall.c:149
    do_el0_svc+0x60/0x150 arch/arm64/kernel/syscall.c:195
    el0_svc+0x34/0xb0 arch/arm64/kernel/entry-common.c:226
    el0_sync_handler+0xc8/0x5b4 arch/arm64/kernel/entry-common.c:236
    el0_sync+0x15c/0x180 arch/arm64/kernel/entry.S:663
    Code: d2d40001 f2fbffe1 91002260 d343fc02 (38e16841)
    ---[ end trace 4edf690313deda44 ]---

    This is because since ec10a24f10c8, the option parsing happens before
    fill_super and so the MTD device isn't associated with the filesystem.
    Defer the size check until there is a valid association.

    Fixes: ec10a24f10c8 ("vfs: Convert jffs2 to use the new mount API")
    Cc:
    Cc: David Howells
    Signed-off-by: Jamie Iles
    Signed-off-by: Richard Weinberger
    Signed-off-by: Sasha Levin

    Jamie Iles
     
  • [ Upstream commit cd3ed3c73ac671ff6b0230ccb72b8300292d3643 ]

    Set rp_size to zero will be ignore during remounting.

    The method to identify whether we input a remounting option of
    rp_size is to check if the rp_size input is zero. It can not work
    well if we pass "rp_size=0".

    This patch add a bool variable "set_rp_size" to fix this problem.

    Reported-by: Jubin Zhong
    Signed-off-by: lizhe
    Signed-off-by: Richard Weinberger
    Signed-off-by: Sasha Levin

    lizhe
     
  • commit dfea9fce29fda6f2f91161677e0e0d9b671bc099 upstream.

    The purpose of io_uring_cancel_files() is to wait for all requests
    matching ->files to go/be cancelled. We should first drop files of a
    request in io_req_drop_files() and only then make it undiscoverable for
    io_uring_cancel_files.

    First drop, then delete from list. It's ok to leave req->id->files
    dangling, because it's not dereferenced by cancellation code, only
    compared against. It would potentially go to sleep and be awaken by
    following in io_req_drop_files() wake_up().

    Fixes: 0f2122045b946 ("io_uring: don't rely on weak ->files references")
    Cc: # 5.5+
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe
    Signed-off-by: Greg Kroah-Hartman

    Pavel Begunkov
     

30 Dec, 2020

1 commit

  • commit 398840f8bb935d33c64df4ec4fed77a7d24c267d upstream.

    This was an oversight in the original implementation, as it makes no
    sense to specify both scoping flags to the same openat2(2) invocation
    (before this patch, the result of such an invocation was equivalent to
    RESOLVE_IN_ROOT being ignored).

    This is a userspace-visible ABI change, but the only user of openat2(2)
    at the moment is LXC which doesn't specify both flags and so no
    userspace programs will break as a result.

    Fixes: fddb5d430ad9 ("open: introduce openat2(2) syscall")
    Signed-off-by: Aleksa Sarai
    Acked-by: Christian Brauner
    Cc: # v5.6+
    Link: https://lore.kernel.org/r/20201027235044.5240-2-cyphar@cyphar.com
    Signed-off-by: Christian Brauner
    Signed-off-by: Greg Kroah-Hartman

    Aleksa Sarai