09 Nov, 2022

2 commits

  • commit 21a87d88c2253350e115029f14fe2a10a7e6c856 upstream.

    If the i_mode field in inode of metadata files is corrupted on disk, it
    can cause the initialization of bmap structure, which should have been
    called from nilfs_read_inode_common(), not to be called. This causes a
    lockdep warning followed by a NULL pointer dereference at
    nilfs_bmap_lookup_at_level().

    This patch fixes these issues by adding a missing sanitiy check for the
    i_mode field of metadata file's inode.

    Link: https://lkml.kernel.org/r/20221002030804.29978-1-konishi.ryusuke@gmail.com
    Signed-off-by: Ryusuke Konishi
    Reported-by: syzbot+2b32eb36c1a825b7a74c@syzkaller.appspotmail.com
    Reported-by: Tetsuo Handa
    Tested-by: Ryusuke Konishi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman
    (cherry picked from commit 1e512c65b4adcdbdf7aead052f2162b079cc7f55)

    Ryusuke Konishi
     
  • commit 61a1d87a324ad5e3ed27c6699dfc93218fcf3201 upstream.

    The check in __ext4_read_dirblock() for block being outside of directory
    size was wrong because it compared block number against directory size
    in bytes. Fix it.

    Fixes: 65f8ea4cd57d ("ext4: check if directory block is within i_size")
    CVE: CVE-2022-1184
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Reviewed-by: Lukas Czerner
    Link: https://lore.kernel.org/r/20220822114832.1482-1-jack@suse.cz
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman
    (cherry picked from commit dd366295d1eca557e7a9000407ec3952f691d27b)

    Jan Kara
     

28 Sep, 2022

13 commits

  • commit a9f2a2931d0e197ab28c6007966053fdababd53f upstream.

    Curently we don't use any preallocation when a file is already closed
    when allocating blocks (from writeback code when converting delayed
    allocation). However for small files, using locality group preallocation
    is actually desirable as that is not specific to a particular file.
    Rather it is a method to pack small files together to reduce
    fragmentation and for that the fact the file is closed is actually even
    stronger hint the file would benefit from packing. So change the logic
    to allow locality group preallocation in this case.

    Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
    CC: stable@kernel.org
    Reported-and-tested-by: Stefan Wahren
    Tested-by: Ojaswin Mujoo
    Reviewed-by: Ritesh Harjani (IBM)
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
    Link: https://lore.kernel.org/r/20220908092136.11770-4-jack@suse.cz
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 1940265ede6683f6317cba0d428ce6505eaca944 upstream.

    mb_set_largest_free_order() updates lists containing groups with largest
    chunk of free space of given order. The way it updates it leads to
    always moving the group to the tail of the list. Thus allocations
    looking for free space of given order effectively end up cycling through
    all groups (and due to initialization in last to first order). This
    spreads allocations among block groups which reduces performance for
    rotating disks or low-end flash media. Change
    mb_set_largest_free_order() to only update lists if the order of the
    largest free chunk in the group changed.

    Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
    CC: stable@kernel.org
    Reported-and-tested-by: Stefan Wahren
    Tested-by: Ojaswin Mujoo
    Reviewed-by: Ritesh Harjani (IBM)
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
    Link: https://lore.kernel.org/r/20220908092136.11770-2-jack@suse.cz
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 4fca50d440cc5d4dc570ad5484cc0b70b381bc2a upstream.

    One of the side-effects of mb_optimize_scan was that the optimized
    functions to select next group to try were called even before we tried
    the goal group. As a result we no longer allocate files close to
    corresponding inodes as well as we don't try to expand currently
    allocated extent in the same group. This results in reaim regression
    with workfile.disk workload of upto 8% with many clients on my test
    machine:

    baseline mb_optimize_scan
    Hmean disk-1 2114.16 ( 0.00%) 2099.37 ( -0.70%)
    Hmean disk-41 87794.43 ( 0.00%) 83787.47 * -4.56%*
    Hmean disk-81 148170.73 ( 0.00%) 135527.05 * -8.53%*
    Hmean disk-121 177506.11 ( 0.00%) 166284.93 * -6.32%*
    Hmean disk-161 220951.51 ( 0.00%) 207563.39 * -6.06%*
    Hmean disk-201 208722.74 ( 0.00%) 203235.59 ( -2.63%)
    Hmean disk-241 222051.60 ( 0.00%) 217705.51 ( -1.96%)
    Hmean disk-281 252244.17 ( 0.00%) 241132.72 * -4.41%*
    Hmean disk-321 255844.84 ( 0.00%) 245412.84 * -4.08%*

    Also this is causing huge regression (time increased by a factor of 5 or
    so) when untarring archive with lots of small files on some eMMC storage
    cards.

    Fix the problem by making sure we try goal group first.

    Fixes: 196e402adf2e ("ext4: improve cr 0 / cr 1 group scanning")
    CC: stable@kernel.org
    Reported-and-tested-by: Stefan Wahren
    Tested-by: Ojaswin Mujoo
    Reviewed-by: Ritesh Harjani (IBM)
    Link: https://lore.kernel.org/all/20220727105123.ckwrhbilzrxqpt24@quack3/
    Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20220908092136.11770-1-jack@suse.cz
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 80fa46d6b9e7b1527bfd2197d75431fd9c382161 upstream.

    This patch avoids threads live-locking for hours when a large number
    threads are competing over the last few free extents as they blocks
    getting added and removed from preallocation pools. From our bug
    reporter:

    A reliable way for triggering this has multiple writers
    continuously write() to files when the filesystem is full, while
    small amounts of space are freed (e.g. by truncating a large file
    -1MiB at a time). In the local filesystem, this can be done by
    simply not checking the return code of write (0) and/or the error
    (ENOSPACE) that is set. Over NFS with an async mount, even clients
    with proper error checking will behave this way since the linux NFS
    client implementation will not propagate the server errors [the
    write syscalls immediately return success] until the file handle is
    closed. This leads to a situation where NFS clients send a
    continuous stream of WRITE rpcs which result in ERRNOSPACE -- but
    since the client isn't seeing this, the stream of writes continues
    at maximum network speed.

    When some space does appear, multiple writers will all attempt to
    claim it for their current write. For NFS, we may see dozens to
    hundreds of threads that do this.

    The real-world scenario of this is database backup tooling (in
    particular, github.com/mdkent/percona-xtrabackup) which may write
    large files (>1TiB) to NFS for safe keeping. Some temporary files
    are written, rewound, and read back -- all before closing the file
    handle (the temp file is actually unlinked, to trigger automatic
    deletion on close/crash.) An application like this operating on an
    async NFS mount will not see an error code until TiB have been
    written/read.

    The lockup was observed when running this database backup on large
    filesystems (64 TiB in this case) with a high number of block
    groups and no free space. Fragmentation is generally not a factor
    in this filesystem (~thousands of large files, mostly contiguous
    except for the parts written while the filesystem is at capacity.)

    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Theodore Ts'o
     
  • commit 29a5b8a137ac8eb410cc823653a29ac0e7b7e1b0 upstream.

    When walking through an inode extents, the ext4_ext_binsearch_idx() function
    assumes that the extent header has been previously validated. However, there
    are no checks that verify that the number of entries (eh->eh_entries) is
    non-zero when depth is > 0. And this will lead to problems because the
    EXT_FIRST_INDEX() and EXT_LAST_INDEX() will return garbage and result in this:

    [ 135.245946] ------------[ cut here ]------------
    [ 135.247579] kernel BUG at fs/ext4/extents.c:2258!
    [ 135.249045] invalid opcode: 0000 [#1] PREEMPT SMP
    [ 135.250320] CPU: 2 PID: 238 Comm: tmp118 Not tainted 5.19.0-rc8+ #4
    [ 135.252067] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b-rebuilt.opensuse.org 04/01/2014
    [ 135.255065] RIP: 0010:ext4_ext_map_blocks+0xc20/0xcb0
    [ 135.256475] Code:
    [ 135.261433] RSP: 0018:ffffc900005939f8 EFLAGS: 00010246
    [ 135.262847] RAX: 0000000000000024 RBX: ffffc90000593b70 RCX: 0000000000000023
    [ 135.264765] RDX: ffff8880038e5f10 RSI: 0000000000000003 RDI: ffff8880046e922c
    [ 135.266670] RBP: ffff8880046e9348 R08: 0000000000000001 R09: ffff888002ca580c
    [ 135.268576] R10: 0000000000002602 R11: 0000000000000000 R12: 0000000000000024
    [ 135.270477] R13: 0000000000000000 R14: 0000000000000024 R15: 0000000000000000
    [ 135.272394] FS: 00007fdabdc56740(0000) GS:ffff88807dd00000(0000) knlGS:0000000000000000
    [ 135.274510] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 135.276075] CR2: 00007ffc26bd4f00 CR3: 0000000006261004 CR4: 0000000000170ea0
    [ 135.277952] Call Trace:
    [ 135.278635]
    [ 135.279247] ? preempt_count_add+0x6d/0xa0
    [ 135.280358] ? percpu_counter_add_batch+0x55/0xb0
    [ 135.281612] ? _raw_read_unlock+0x18/0x30
    [ 135.282704] ext4_map_blocks+0x294/0x5a0
    [ 135.283745] ? xa_load+0x6f/0xa0
    [ 135.284562] ext4_mpage_readpages+0x3d6/0x770
    [ 135.285646] read_pages+0x67/0x1d0
    [ 135.286492] ? folio_add_lru+0x51/0x80
    [ 135.287441] page_cache_ra_unbounded+0x124/0x170
    [ 135.288510] filemap_get_pages+0x23d/0x5a0
    [ 135.289457] ? path_openat+0xa72/0xdd0
    [ 135.290332] filemap_read+0xbf/0x300
    [ 135.291158] ? _raw_spin_lock_irqsave+0x17/0x40
    [ 135.292192] new_sync_read+0x103/0x170
    [ 135.293014] vfs_read+0x15d/0x180
    [ 135.293745] ksys_read+0xa1/0xe0
    [ 135.294461] do_syscall_64+0x3c/0x80
    [ 135.295284] entry_SYSCALL_64_after_hwframe+0x46/0xb0

    This patch simply adds an extra check in __ext4_ext_check(), verifying that
    eh_entries is not 0 when eh_depth is > 0.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=215941
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=216283
    Cc: Baokun Li
    Cc: stable@kernel.org
    Signed-off-by: Luís Henriques
    Reviewed-by: Jan Kara
    Reviewed-by: Baokun Li
    Link: https://lore.kernel.org/r/20220822094235.2690-1-lhenriques@suse.de
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Luís Henriques
     
  • commit 613c5a85898d1cd44e68f28d65eccf64a8ace9cf upstream.

    Currently the Orlov inode allocator searches for free inodes for a
    directory only in flex block groups with at most inodes_per_group/16
    more directory inodes than average per flex block group. However with
    growing size of flex block group this becomes unnecessarily strict.
    Scale allowed difference from average directory count per flex block
    group with flex block group size as we do with other metrics.

    Tested-by: Stefan Wahren
    Tested-by: Ojaswin Mujoo
    Cc: stable@kernel.org
    Link: https://lore.kernel.org/all/0d81a7c2-46b7-6010-62a4-3e6cfc1628d6@i2se.com/
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20220908092136.11770-3-jack@suse.cz
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 6e176d47160cec8bcaa28d9aa06926d72d54237c upstream.

    We mustn't call nfs_wb_all() on anything other than a regular file.
    Furthermore, we can exit early when we don't hold a delegation.

    Reported-by: David Wysochanski
    Signed-off-by: Trond Myklebust
    Cc: Thorsten Leemhuis
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • [ Upstream commit 17d9c15c9b9e7fb285f7ac5367dfb5f00ff575e3 ]

    I got an infinite loop and a WARNING report when executing a tail command
    in virtiofs.

    WARNING: CPU: 10 PID: 964 at fs/iomap/iter.c:34 iomap_iter+0x3a2/0x3d0
    Modules linked in:
    CPU: 10 PID: 964 Comm: tail Not tainted 5.19.0-rc7
    Call Trace:

    dax_iomap_rw+0xea/0x620
    ? __this_cpu_preempt_check+0x13/0x20
    fuse_dax_read_iter+0x47/0x80
    fuse_file_read_iter+0xae/0xd0
    new_sync_read+0xfe/0x180
    ? 0xffffffff81000000
    vfs_read+0x14d/0x1a0
    ksys_read+0x6d/0xf0
    __x64_sys_read+0x1a/0x20
    do_syscall_64+0x3b/0x90
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    The tail command will call read() with a count of 0. In this case,
    iomap_iter() will report this WARNING, and always return 1 which casuing
    the infinite loop in dax_iomap_rw().

    Fixing by checking count whether is 0 in dax_iomap_rw().

    Fixes: ca289e0b95af ("fsdax: switch dax_iomap_rw to use iomap_iter")
    Signed-off-by: Li Jinlin
    Reviewed-by: Darrick J. Wong
    Link: https://lore.kernel.org/r/20220725032050.3873372-1-lijinlin3@huawei.com
    Signed-off-by: Dan Williams
    Signed-off-by: Sasha Levin

    Li Jinlin
     
  • [ Upstream commit 1eb70f54c445fcbb25817841e774adb3d912f3e8 ]

    xfs_repair catches fork size/format mismatches, but the in-kernel
    verifier doesn't, leading to null pointer failures when attempting
    to perform operations on the fork. This can occur in the
    xfs_dir_is_empty() where the in-memory fork format does not match
    the size and so the fork data pointer is accessed incorrectly.

    Note: this causes new failures in xfs/348 which is testing mode vs
    ftype mismatches. We now detect a regular file that has been changed
    to a directory or symlink mode as being corrupt because the data
    fork is for a symlink or directory should be in local form when
    there are only 3 bytes of data in the data fork. Hence the inode
    verify for the regular file now fires w/ -EFSCORRUPTED because
    the inode fork format does not match the format the corrupted mode
    says it should be in.

    Signed-off-by: Dave Chinner
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner
    Signed-off-by: Leah Rumancik
    Acked-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • [ Upstream commit 6f5097e3367a7c0751e165e4c15bc30511a4ba38 ]

    For some reason commit 9a5280b312e2e ("xfs: reorder iunlink remove
    operation in xfs_ifree") replaced a jump to the exit path in the
    event of an xfs_difree() error with a direct return, which skips
    releasing the perag reference acquired at the top of the function.
    Restore the original code to drop the reference on error.

    Fixes: 9a5280b312e2e ("xfs: reorder iunlink remove operation in xfs_ifree")
    Signed-off-by: Brian Foster
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Dave Chinner
    Signed-off-by: Leah Rumancik
    Acked-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Brian Foster
     
  • [ Upstream commit 9a5280b312e2e7898b6397b2ca3cfd03f67d7be1 ]

    The O_TMPFILE creation implementation creates a specific order of
    operations for inode allocation/freeing and unlinked list
    modification. Currently both are serialised by the AGI, so the order
    doesn't strictly matter as long as the are both in the same
    transaction.

    However, if we want to move the unlinked list insertions largely out
    from under the AGI lock, then we have to be concerned about the
    order in which we do unlinked list modification operations.
    O_TMPFILE creation tells us this order is inode allocation/free,
    then unlinked list modification.

    Change xfs_ifree() to use this same ordering on unlinked list
    removal. This way we always guarantee that when we enter the
    iunlinked list removal code from this path, we already have the AGI
    locked and we don't have to worry about lock nesting AGI reads
    inside unlink list locks because it's already locked and attached to
    the transaction.

    We can do this safely as the inode freeing and unlinked list removal
    are done in the same transaction and hence are atomic operations
    with respect to log recovery.

    Reported-by: Frank Hofmann
    Fixes: 298f7bec503f ("xfs: pin inode backing buffer to the inode log item")
    Signed-off-by: Dave Chinner
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner
    Signed-off-by: Leah Rumancik
    Acked-by: Darrick J. Wong
    Signed-off-by: Greg Kroah-Hartman

    Dave Chinner
     
  • commit a362bb864b8db4861977d00bd2c3222503ccc34b upstream.

    Often when running generic/562 from fstests we can hang during unmount,
    resulting in a trace like this:

    Sep 07 11:52:00 debian9 unknown: run fstests generic/562 at 2022-09-07 11:52:00
    Sep 07 11:55:32 debian9 kernel: INFO: task umount:49438 blocked for more than 120 seconds.
    Sep 07 11:55:32 debian9 kernel: Not tainted 6.0.0-rc2-btrfs-next-122 #1
    Sep 07 11:55:32 debian9 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Sep 07 11:55:32 debian9 kernel: task:umount state:D stack: 0 pid:49438 ppid: 25683 flags:0x00004000
    Sep 07 11:55:32 debian9 kernel: Call Trace:
    Sep 07 11:55:32 debian9 kernel:
    Sep 07 11:55:32 debian9 kernel: __schedule+0x3c8/0xec0
    Sep 07 11:55:32 debian9 kernel: ? rcu_read_lock_sched_held+0x12/0x70
    Sep 07 11:55:32 debian9 kernel: schedule+0x5d/0xf0
    Sep 07 11:55:32 debian9 kernel: schedule_timeout+0xf1/0x130
    Sep 07 11:55:32 debian9 kernel: ? lock_release+0x224/0x4a0
    Sep 07 11:55:32 debian9 kernel: ? lock_acquired+0x1a0/0x420
    Sep 07 11:55:32 debian9 kernel: ? trace_hardirqs_on+0x2c/0xd0
    Sep 07 11:55:32 debian9 kernel: __wait_for_common+0xac/0x200
    Sep 07 11:55:32 debian9 kernel: ? usleep_range_state+0xb0/0xb0
    Sep 07 11:55:32 debian9 kernel: __flush_work+0x26d/0x530
    Sep 07 11:55:32 debian9 kernel: ? flush_workqueue_prep_pwqs+0x140/0x140
    Sep 07 11:55:32 debian9 kernel: ? trace_clock_local+0xc/0x30
    Sep 07 11:55:32 debian9 kernel: __cancel_work_timer+0x11f/0x1b0
    Sep 07 11:55:32 debian9 kernel: ? close_ctree+0x12b/0x5b3 [btrfs]
    Sep 07 11:55:32 debian9 kernel: ? __trace_bputs+0x10b/0x170
    Sep 07 11:55:32 debian9 kernel: close_ctree+0x152/0x5b3 [btrfs]
    Sep 07 11:55:32 debian9 kernel: ? evict_inodes+0x166/0x1c0
    Sep 07 11:55:32 debian9 kernel: generic_shutdown_super+0x71/0x120
    Sep 07 11:55:32 debian9 kernel: kill_anon_super+0x14/0x30
    Sep 07 11:55:32 debian9 kernel: btrfs_kill_super+0x12/0x20 [btrfs]
    Sep 07 11:55:32 debian9 kernel: deactivate_locked_super+0x2e/0xa0
    Sep 07 11:55:32 debian9 kernel: cleanup_mnt+0x100/0x160
    Sep 07 11:55:32 debian9 kernel: task_work_run+0x59/0xa0
    Sep 07 11:55:32 debian9 kernel: exit_to_user_mode_prepare+0x1a6/0x1b0
    Sep 07 11:55:32 debian9 kernel: syscall_exit_to_user_mode+0x16/0x40
    Sep 07 11:55:32 debian9 kernel: do_syscall_64+0x48/0x90
    Sep 07 11:55:32 debian9 kernel: entry_SYSCALL_64_after_hwframe+0x63/0xcd
    Sep 07 11:55:32 debian9 kernel: RIP: 0033:0x7fcde59a57a7
    Sep 07 11:55:32 debian9 kernel: RSP: 002b:00007ffe914217c8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
    Sep 07 11:55:32 debian9 kernel: RAX: 0000000000000000 RBX: 00007fcde5ae8264 RCX: 00007fcde59a57a7
    Sep 07 11:55:32 debian9 kernel: RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000055b57556cdd0
    Sep 07 11:55:32 debian9 kernel: RBP: 000055b57556cba0 R08: 0000000000000000 R09: 00007ffe91420570
    Sep 07 11:55:32 debian9 kernel: R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    Sep 07 11:55:32 debian9 kernel: R13: 000055b57556cdd0 R14: 000055b57556ccb8 R15: 0000000000000000
    Sep 07 11:55:32 debian9 kernel:

    What happens is the following:

    1) The cleaner kthread tries to start a transaction to delete an unused
    block group, but the metadata reservation can not be satisfied right
    away, so a reservation ticket is created and it starts the async
    metadata reclaim task (fs_info->async_reclaim_work);

    2) Writeback for all the filler inodes with an i_size of 2K starts
    (generic/562 creates a lot of 2K files with the goal of filling
    metadata space). We try to create an inline extent for them, but we
    fail when trying to insert the inline extent with -ENOSPC (at
    cow_file_range_inline()) - since this is not critical, we fallback
    to non-inline mode (back to cow_file_range()), reserve extents, create
    extent maps and create the ordered extents;

    3) An unmount starts, enters close_ctree();

    4) The async reclaim task is flushing stuff, entering the flush states one
    by one, until it reaches RUN_DELAYED_IPUTS. There it runs all current
    delayed iputs.

    After running the delayed iputs and before calling
    btrfs_wait_on_delayed_iputs(), one or more ordered extents complete,
    and btrfs_add_delayed_iput() is called for each one through
    btrfs_finish_ordered_io() -> btrfs_put_ordered_extent(). This results
    in bumping fs_info->nr_delayed_iputs from 0 to some positive value.

    So the async reclaim task blocks at btrfs_wait_on_delayed_iputs() waiting
    for fs_info->nr_delayed_iputs to become 0;

    5) The current transaction is committed by the transaction kthread, we then
    start unpinning extents and end up calling btrfs_try_granting_tickets()
    through unpin_extent_range(), since we released some space.
    This results in satisfying the ticket created by the cleaner kthread at
    step 1, waking up the cleaner kthread;

    6) At close_ctree() we ask the cleaner kthread to park;

    7) The cleaner kthread starts the transaction, deletes the unused block
    group, and then calls kthread_should_park(), which returns true, so it
    parks. And at this point we have the delayed iputs added by the
    completion of the ordered extents still pending;

    8) Then later at close_ctree(), when we call:

    cancel_work_sync(&fs_info->async_reclaim_work);

    We hang forever, since the cleaner was parked and no one else can run
    delayed iputs after that, while the reclaim task is waiting for the
    remaining delayed iputs to be completed.

    Fix this by waiting for all ordered extents to complete and running the
    delayed iputs before attempting to stop the async reclaim tasks. Note that
    we can not wait for ordered extents with btrfs_wait_ordered_roots() (or
    other similar functions) because that waits for the BTRFS_ORDERED_COMPLETE
    flag to be set on an ordered extent, but the delayed iput is added after
    that, when doing the final btrfs_put_ordered_extent(). So instead wait for
    the work queues used for executing ordered extent completion to be empty,
    which works because we do the final put on an ordered extent at
    btrfs_finish_ordered_io() (while we are in the unmount context).

    Fixes: d6fd0ae25c6495 ("Btrfs: fix missing delayed iputs on unmount")
    CC: stable@vger.kernel.org # 5.15+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 8a1f1e3d1eecf9d2359a2709e276743a67e145db upstream.

    During early unmount, at close_ctree(), we try to stop the block group
    reclaim task with cancel_work_sync(), but that may hang if the block group
    reclaim task is currently at btrfs_relocate_block_group() waiting for the
    flag BTRFS_FS_UNFINISHED_DROPS to be cleared from fs_info->flags. During
    unmount we only clear that flag later, after trying to stop the block
    group reclaim task.

    Fix that by clearing BTRFS_FS_UNFINISHED_DROPS before trying to stop the
    block group reclaim task and after setting BTRFS_FS_CLOSING_START, so that
    if the reclaim task is waiting on that bit, it will stop immediately after
    being woken, because it sees the filesystem is closing (with a call to
    btrfs_fs_closing()), and then returns immediately with -EINTR.

    Fixes: 31e70e527806c5 ("btrfs: fix hang during unmount when block group reclaim task is running")
    CC: stable@vger.kernel.org # 5.15+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     

23 Sep, 2022

5 commits

  • [ Upstream commit 0066f1b0e27556381402db3ff31f85d2a2265858 ]

    When trying to get a file lock on an AFS file, the server may return
    UAEAGAIN to indicate that the lock is already held. This is currently
    translated by the default path to -EREMOTEIO.

    Translate it instead to -EAGAIN so that we know we can retry it.

    Signed-off-by: David Howells
    Reviewed-by: Jeffrey E Altman
    cc: Marc Dionne
    cc: linux-afs@lists.infradead.org
    Link: https://lore.kernel.org/r/166075761334.3533338.2591992675160918098.stgit@warthog.procyon.org.uk/
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    David Howells
     
  • commit bedc8f76b3539ac4f952114b316bcc2251e808ce upstream.

    So far we were just lucky because the uninitialized members
    of struct msghdr are not used by default on a SOCK_STREAM tcp
    socket.

    But as new things like msg_ubuf and sg_from_iter where added
    recently, we should play on the safe side and avoid potention
    problems in future.

    Signed-off-by: Stefan Metzmacher
    Cc: stable@vger.kernel.org
    Reviewed-by: Paulo Alcantara (SUSE)
    Reviewed-by: Ronnie Sahlberg
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Stefan Metzmacher
     
  • commit 17d3df38dc5f4cec9b0ac6eb79c1859b6e2693a4 upstream.

    This is ignored anyway by the tcp layer.

    Signed-off-by: Stefan Metzmacher
    Cc: stable@vger.kernel.org
    Reviewed-by: Ronnie Sahlberg
    Reviewed-by: Paulo Alcantara (SUSE)
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Stefan Metzmacher
     
  • commit 7500a99281dfed2d4a84771c933bcb9e17af279b upstream.

    Kernel bugzilla: 216301

    When doing direct writes we need to also invalidate the mapping in case
    we have a cached copy of the affected page(s) in memory or else
    subsequent reads of the data might return the old/stale content
    before we wrote an update to the server.

    Cc: stable@vger.kernel.org
    Reviewed-by: Paulo Alcantara (SUSE)
    Signed-off-by: Ronnie Sahlberg
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Ronnie Sahlberg
     
  • [ Upstream commit 2a9d683b48c8a87e61a4215792d44c90bcbbb536 ]

    The NFSv4.0 protocol only supports open() by name. It cannot therefore
    be used with open_by_handle() and friends, nor can it be re-exported by
    knfsd.

    Reported-by: Chuck Lever III
    Fixes: 20fa19027286 ("nfs: add export operations")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Trond Myklebust
     

20 Sep, 2022

1 commit

  • [ Upstream commit 47311db8e8f33011d90dee76b39c8886120cdda4 ]

    Users may have explicitly configured their tracefs permissions; we
    shouldn't overwrite those just because a second mount appeared.

    Only clobber if the options were provided at mount time.

    Note: the previous behavior was especially surprising in the presence of
    automounted /sys/kernel/debug/tracing/.

    Existing behavior:

    ## Pre-existing status: tracefs is 0755.
    # stat -c '%A' /sys/kernel/tracing/
    drwxr-xr-x

    ## (Re)trigger the automount.
    # umount /sys/kernel/debug/tracing
    # stat -c '%A' /sys/kernel/debug/tracing/.
    drwx------

    ## Unexpected: the automount changed mode for other mount instances.
    # stat -c '%A' /sys/kernel/tracing/
    drwx------

    New behavior (after this change):

    ## Pre-existing status: tracefs is 0755.
    # stat -c '%A' /sys/kernel/tracing/
    drwxr-xr-x

    ## (Re)trigger the automount.
    # umount /sys/kernel/debug/tracing
    # stat -c '%A' /sys/kernel/debug/tracing/.
    drwxr-xr-x

    ## Expected: the automount does not change other mount instances.
    # stat -c '%A' /sys/kernel/tracing/
    drwxr-xr-x

    Link: https://lkml.kernel.org/r/20220826174353.2.Iab6e5ea57963d6deca5311b27fb7226790d44406@changeid

    Cc: stable@vger.kernel.org
    Fixes: 4282d60689d4f ("tracefs: Add new tracefs file system")
    Signed-off-by: Brian Norris
    Signed-off-by: Steven Rostedt (Google)
    Signed-off-by: Sasha Levin

    Brian Norris
     

15 Sep, 2022

7 commits

  • [ Upstream commit 2f44013e39984c127c6efedf70e6b5f4e9dcf315 ]

    During stress testing with CONFIG_SMP disabled, KASAN reports as below:

    ==================================================================
    BUG: KASAN: use-after-free in __mutex_lock+0xe5/0xc30
    Read of size 8 at addr ffff8881094223f8 by task stress/7789

    CPU: 0 PID: 7789 Comm: stress Not tainted 6.0.0-rc1-00002-g0d53d2e882f9 #3
    Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
    Call Trace:

    ..
    __mutex_lock+0xe5/0xc30
    ..
    z_erofs_do_read_page+0x8ce/0x1560
    ..
    z_erofs_readahead+0x31c/0x580
    ..
    Freed by task 7787
    kasan_save_stack+0x1e/0x40
    kasan_set_track+0x20/0x30
    kasan_set_free_info+0x20/0x40
    __kasan_slab_free+0x10c/0x190
    kmem_cache_free+0xed/0x380
    rcu_core+0x3d5/0xc90
    __do_softirq+0x12d/0x389

    Last potentially related work creation:
    kasan_save_stack+0x1e/0x40
    __kasan_record_aux_stack+0x97/0xb0
    call_rcu+0x3d/0x3f0
    erofs_shrink_workstation+0x11f/0x210
    erofs_shrink_scan+0xdc/0x170
    shrink_slab.constprop.0+0x296/0x530
    drop_slab+0x1c/0x70
    drop_caches_sysctl_handler+0x70/0x80
    proc_sys_call_handler+0x20a/0x2f0
    vfs_write+0x555/0x6c0
    ksys_write+0xbe/0x160
    do_syscall_64+0x3b/0x90

    The root cause is that erofs_workgroup_unfreeze() doesn't reset to
    orig_val thus it causes a race that the pcluster reuses unexpectedly
    before freeing.

    Since UP platforms are quite rare now, such path becomes unnecessary.
    Let's drop such specific-designed path directly instead.

    Fixes: 73f5c66df3e2 ("staging: erofs: fix `erofs_workgroup_{try_to_freeze, unfreeze}'")
    Reviewed-by: Yue Hu
    Reviewed-by: Chao Yu
    Link: https://lore.kernel.org/r/20220902045710.109530-1-hsiangkao@linux.alibaba.com
    Signed-off-by: Gao Xiang
    Signed-off-by: Sasha Levin

    Gao Xiang
     
  • [ Upstream commit 7903192c4b4a82d792cb0dc5e2779a2efe60d45b ]

    rxrpc and kafs between them try to use the receive timestamp on the first
    data packet (ie. the one with sequence number 1) as a base from which to
    calculate the time at which callback promise and lock expiration occurs.

    However, we don't know how long it took for the server to send us the reply
    from it having completed the basic part of the operation - it might then,
    for instance, have to send a bunch of a callback breaks, depending on the
    particular operation.

    Fix this by using the time at which the operation is issued on the client
    as a base instead. That should never be longer than the server's idea of
    the expiry time.

    Fixes: 781070551c26 ("afs: Fix calculation of callback expiry time")
    Fixes: 2070a3e44962 ("rxrpc: Allow the reply time to be obtained on a client call")
    Suggested-by: Jeffrey E Altman
    Signed-off-by: David Howells
    Signed-off-by: Sasha Levin

    David Howells
     
  • [ Upstream commit 67f4b5dc49913abcdb5cc736e73674e2f352f81d ]

    Currently, when the writeback code detects a server reboot, it redirties
    any pages that were not committed to disk, and it sets the flag
    NFS_CONTEXT_RESEND_WRITES in the nfs_open_context of the file descriptor
    that dirtied the file. While this allows the file descriptor in question
    to redrive its own writes, it violates the fsync() requirement that we
    should be synchronising all writes to disk.
    While the problem is infrequent, we do see corner cases where an
    untimely server reboot causes the fsync() call to abandon its attempt to
    sync data to disk and causing data corruption issues due to missed error
    conditions or similar.

    In order to tighted up the client's ability to deal with this situation
    without introducing livelocks, add a counter that records the number of
    times pages are redirtied due to a server reboot-like condition, and use
    that in fsync() to redrive the sync to disk.

    Fixes: 2197e9b06c22 ("NFS: Fix up fsync() when the server rebooted")
    Cc: stable@vger.kernel.org
    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Trond Myklebust
     
  • [ Upstream commit e591b298d7ecb851e200f65946e3d53fe78a3c4f ]

    Save some space in the nfs_inode by setting up an anonymous union with
    the fields that are peculiar to a specific type of filesystem object.

    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Trond Myklebust
     
  • [ Upstream commit ff81dfb5d721fff87bd516c558847f6effb70031 ]

    If a user is doing 'ls -l', we have a heuristic in GETATTR that tells
    the readdir code to try to use READDIRPLUS in order to refresh the inode
    attributes. In certain cirumstances, we also try to invalidate the
    remaining directory entries in order to ensure this refresh.

    If there are multiple readers of the directory, we probably should avoid
    invalidating the page cache, since the heuristic breaks down in that
    situation anyway.

    Signed-off-by: Trond Myklebust
    Tested-by: Benjamin Coddington
    Reviewed-by: Benjamin Coddington
    Signed-off-by: Sasha Levin

    Trond Myklebust
     
  • commit dec9b2f1e0455a151a7293c367da22ab973f713e upstream.

    There is a very common pattern of using
    debugfs_remove(debufs_lookup(..)) which results in a dentry leak of the
    dentry that was looked up. Instead of having to open-code the correct
    pattern of calling dput() on the dentry, create
    debugfs_lookup_and_remove() to handle this pattern automatically and
    properly without any memory leaks.

    Cc: stable
    Reported-by: Kuyo Chang
    Tested-by: Kuyo Chang
    Link: https://lore.kernel.org/r/YxIaQ8cSinDR881k@kroah.com
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     
  • commit cac5c44c48c9fb9cc31bea15ebd9ef0c6462314f upstream.

    The commit 7d7672bc5d10 ("btrfs: convert count_max_extents() to use
    fs_info->max_extent_size") introduced a division by
    fs_info->max_extent_size. This max_extent_size is initialized with max
    zone append limit size of the device btrfs runs on. However, in zone
    emulation mode, the device is not zoned then its zone append limit is
    zero. This resulted in zero value of fs_info->max_extent_size and caused
    zero division error.

    Fix the error by setting non-zero pseudo value to max append zone limit
    in zone emulation mode. Set the pseudo value based on max_segments as
    suggested in the commit c2ae7b772ef4 ("btrfs: zoned: revive
    max_zone_append_bytes").

    Fixes: 7d7672bc5d10 ("btrfs: convert count_max_extents() to use fs_info->max_extent_size")
    CC: stable@vger.kernel.org # 5.12+
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Naohiro Aota
    Signed-off-by: Shin'ichiro Kawasaki
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Shin'ichiro Kawasaki
     

08 Sep, 2022

1 commit

  • commit 27893dfc1285f80f80f46b3b8c95f5d15d2e66d0 upstream.

    In some cases of failure (dialect mismatches) in SMB2_negotiate(), after
    the request is sent, the checks would return -EIO when they should be
    rather setting rc = -EIO and jumping to neg_exit to free the response
    buffer from mempool.

    Signed-off-by: Enzo Matsumiya
    Cc: stable@vger.kernel.org
    Reviewed-by: Ronnie Sahlberg
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Enzo Matsumiya
     

05 Sep, 2022

11 commits

  • commit ced8ecf026fd8084cf175530ff85c76d6085d715 upstream.

    When testing space_cache v2 on a large set of machines, we encountered a
    few symptoms:

    1. "unable to add free space :-17" (EEXIST) errors.
    2. Missing free space info items, sometimes caught with a "missing free
    space info for X" error.
    3. Double-accounted space: ranges that were allocated in the extent tree
    and also marked as free in the free space tree, ranges that were
    marked as allocated twice in the extent tree, or ranges that were
    marked as free twice in the free space tree. If the latter made it
    onto disk, the next reboot would hit the BUG_ON() in
    add_new_free_space().
    4. On some hosts with no on-disk corruption or error messages, the
    in-memory space cache (dumped with drgn) disagreed with the free
    space tree.

    All of these symptoms have the same underlying cause: a race between
    caching the free space for a block group and returning free space to the
    in-memory space cache for pinned extents causes us to double-add a free
    range to the space cache. This race exists when free space is cached
    from the free space tree (space_cache=v2) or the extent tree
    (nospace_cache, or space_cache=v1 if the cache needs to be regenerated).
    struct btrfs_block_group::last_byte_to_unpin and struct
    btrfs_block_group::progress are supposed to protect against this race,
    but commit d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when
    waiting for a transaction commit") subtly broke this by allowing
    multiple transactions to be unpinning extents at the same time.

    Specifically, the race is as follows:

    1. An extent is deleted from an uncached block group in transaction A.
    2. btrfs_commit_transaction() is called for transaction A.
    3. btrfs_run_delayed_refs() -> __btrfs_free_extent() runs the delayed
    ref for the deleted extent.
    4. __btrfs_free_extent() -> do_free_extent_accounting() ->
    add_to_free_space_tree() adds the deleted extent back to the free
    space tree.
    5. do_free_extent_accounting() -> btrfs_update_block_group() ->
    btrfs_cache_block_group() queues up the block group to get cached.
    block_group->progress is set to block_group->start.
    6. btrfs_commit_transaction() for transaction A calls
    switch_commit_roots(). It sets block_group->last_byte_to_unpin to
    block_group->progress, which is block_group->start because the block
    group hasn't been cached yet.
    7. The caching thread gets to our block group. Since the commit roots
    were already switched, load_free_space_tree() sees the deleted extent
    as free and adds it to the space cache. It finishes caching and sets
    block_group->progress to U64_MAX.
    8. btrfs_commit_transaction() advances transaction A to
    TRANS_STATE_SUPER_COMMITTED.
    9. fsync calls btrfs_commit_transaction() for transaction B. Since
    transaction A is already in TRANS_STATE_SUPER_COMMITTED and the
    commit is for fsync, it advances.
    10. btrfs_commit_transaction() for transaction B calls
    switch_commit_roots(). This time, the block group has already been
    cached, so it sets block_group->last_byte_to_unpin to U64_MAX.
    11. btrfs_commit_transaction() for transaction A calls
    btrfs_finish_extent_commit(), which calls unpin_extent_range() for
    the deleted extent. It sees last_byte_to_unpin set to U64_MAX (by
    transaction B!), so it adds the deleted extent to the space cache
    again!

    This explains all of our symptoms above:

    * If the sequence of events is exactly as described above, when the free
    space is re-added in step 11, it will fail with EEXIST.
    * If another thread reallocates the deleted extent in between steps 7
    and 11, then step 11 will silently re-add that space to the space
    cache as free even though it is actually allocated. Then, if that
    space is allocated *again*, the free space tree will be corrupted
    (namely, the wrong item will be deleted).
    * If we don't catch this free space tree corruption, it will continue
    to get worse as extents are deleted and reallocated.

    The v1 space_cache is synchronously loaded when an extent is deleted
    (btrfs_update_block_group() with alloc=0 calls btrfs_cache_block_group()
    with load_cache_only=1), so it is not normally affected by this bug.
    However, as noted above, if we fail to load the space cache, we will
    fall back to caching from the extent tree and may hit this bug.

    The easiest fix for this race is to also make caching from the free
    space tree or extent tree synchronous. Josef tested this and found no
    performance regressions.

    A few extra changes fall out of this change. Namely, this fix does the
    following, with step 2 being the crucial fix:

    1. Factor btrfs_caching_ctl_wait_done() out of
    btrfs_wait_block_group_cache_done() to allow waiting on a caching_ctl
    that we already hold a reference to.
    2. Change the call in btrfs_cache_block_group() of
    btrfs_wait_space_cache_v1_finished() to
    btrfs_caching_ctl_wait_done(), which makes us wait regardless of the
    space_cache option.
    3. Delete the now unused btrfs_wait_space_cache_v1_finished() and
    space_cache_v1_done().
    4. Change btrfs_cache_block_group()'s `int load_cache_only` parameter to
    `bool wait` to more accurately describe its new meaning.
    5. Change a few callers which had a separate call to
    btrfs_wait_block_group_cache_done() to use wait = true instead.
    6. Make btrfs_wait_block_group_cache_done() static now that it's not
    used outside of block-group.c anymore.

    Fixes: d0c2f4fa555e ("btrfs: make concurrent fsyncs wait less when waiting for a transaction commit")
    CC: stable@vger.kernel.org # 5.12+
    Reviewed-by: Filipe Manana
    Signed-off-by: Omar Sandoval
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Omar Sandoval
     
  • [ Upstream commit 899b7f69f244e539ea5df1b4d756046337de44a5 ]

    We're seeing a weird problem in production where we have overlapping
    extent items in the extent tree. It's unclear where these are coming
    from, and in debugging we realized there's no check in the tree checker
    for this sort of problem. Add a check to the tree-checker to make sure
    that the extents do not overlap each other.

    Reviewed-by: Qu Wenruo
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik
     
  • [ Upstream commit b40130b23ca4a08c5785d5a3559805916bddba3c ]

    We have been hitting the following lockdep splat with btrfs/187 recently

    WARNING: possible circular locking dependency detected
    5.19.0-rc8+ #775 Not tainted
    ------------------------------------------------------
    btrfs/752500 is trying to acquire lock:
    ffff97e1875a97b8 (btrfs-treloc-02#2){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110

    but task is already holding lock:
    ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110

    which lock already depends on the new lock.

    the existing dependency chain (in reverse order) is:

    -> #2 (btrfs-tree-01/1){+.+.}-{3:3}:
    down_write_nested+0x41/0x80
    __btrfs_tree_lock+0x24/0x110
    btrfs_init_new_buffer+0x7d/0x2c0
    btrfs_alloc_tree_block+0x120/0x3b0
    __btrfs_cow_block+0x136/0x600
    btrfs_cow_block+0x10b/0x230
    btrfs_search_slot+0x53b/0xb70
    btrfs_lookup_inode+0x2a/0xa0
    __btrfs_update_delayed_inode+0x5f/0x280
    btrfs_async_run_delayed_root+0x24c/0x290
    btrfs_work_helper+0xf2/0x3e0
    process_one_work+0x271/0x590
    worker_thread+0x52/0x3b0
    kthread+0xf0/0x120
    ret_from_fork+0x1f/0x30

    -> #1 (btrfs-tree-01){++++}-{3:3}:
    down_write_nested+0x41/0x80
    __btrfs_tree_lock+0x24/0x110
    btrfs_search_slot+0x3c3/0xb70
    do_relocation+0x10c/0x6b0
    relocate_tree_blocks+0x317/0x6d0
    relocate_block_group+0x1f1/0x560
    btrfs_relocate_block_group+0x23e/0x400
    btrfs_relocate_chunk+0x4c/0x140
    btrfs_balance+0x755/0xe40
    btrfs_ioctl+0x1ea2/0x2c90
    __x64_sys_ioctl+0x88/0xc0
    do_syscall_64+0x38/0x90
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    -> #0 (btrfs-treloc-02#2){+.+.}-{3:3}:
    __lock_acquire+0x1122/0x1e10
    lock_acquire+0xc2/0x2d0
    down_write_nested+0x41/0x80
    __btrfs_tree_lock+0x24/0x110
    btrfs_lock_root_node+0x31/0x50
    btrfs_search_slot+0x1cb/0xb70
    replace_path+0x541/0x9f0
    merge_reloc_root+0x1d6/0x610
    merge_reloc_roots+0xe2/0x260
    relocate_block_group+0x2c8/0x560
    btrfs_relocate_block_group+0x23e/0x400
    btrfs_relocate_chunk+0x4c/0x140
    btrfs_balance+0x755/0xe40
    btrfs_ioctl+0x1ea2/0x2c90
    __x64_sys_ioctl+0x88/0xc0
    do_syscall_64+0x38/0x90
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    other info that might help us debug this:

    Chain exists of:
    btrfs-treloc-02#2 --> btrfs-tree-01 --> btrfs-tree-01/1

    Possible unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(btrfs-tree-01/1);
    lock(btrfs-tree-01);
    lock(btrfs-tree-01/1);
    lock(btrfs-treloc-02#2);

    *** DEADLOCK ***

    7 locks held by btrfs/752500:
    #0: ffff97e292fdf460 (sb_writers#12){.+.+}-{0:0}, at: btrfs_ioctl+0x208/0x2c90
    #1: ffff97e284c02050 (&fs_info->reclaim_bgs_lock){+.+.}-{3:3}, at: btrfs_balance+0x55f/0xe40
    #2: ffff97e284c00878 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_relocate_block_group+0x236/0x400
    #3: ffff97e292fdf650 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xef/0x610
    #4: ffff97e284c02378 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
    #5: ffff97e284c023a0 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x1a8/0x5a0
    #6: ffff97e1875a9278 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x24/0x110

    stack backtrace:
    CPU: 1 PID: 752500 Comm: btrfs Not tainted 5.19.0-rc8+ #775
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
    Call Trace:

    dump_stack_lvl+0x56/0x73
    check_noncircular+0xd6/0x100
    ? lock_is_held_type+0xe2/0x140
    __lock_acquire+0x1122/0x1e10
    lock_acquire+0xc2/0x2d0
    ? __btrfs_tree_lock+0x24/0x110
    down_write_nested+0x41/0x80
    ? __btrfs_tree_lock+0x24/0x110
    __btrfs_tree_lock+0x24/0x110
    btrfs_lock_root_node+0x31/0x50
    btrfs_search_slot+0x1cb/0xb70
    ? lock_release+0x137/0x2d0
    ? _raw_spin_unlock+0x29/0x50
    ? release_extent_buffer+0x128/0x180
    replace_path+0x541/0x9f0
    merge_reloc_root+0x1d6/0x610
    merge_reloc_roots+0xe2/0x260
    relocate_block_group+0x2c8/0x560
    btrfs_relocate_block_group+0x23e/0x400
    btrfs_relocate_chunk+0x4c/0x140
    btrfs_balance+0x755/0xe40
    btrfs_ioctl+0x1ea2/0x2c90
    ? lock_is_held_type+0xe2/0x140
    ? lock_is_held_type+0xe2/0x140
    ? __x64_sys_ioctl+0x88/0xc0
    __x64_sys_ioctl+0x88/0xc0
    do_syscall_64+0x38/0x90
    entry_SYSCALL_64_after_hwframe+0x63/0xcd

    This isn't necessarily new, it's just tricky to hit in practice. There
    are two competing things going on here. With relocation we create a
    snapshot of every fs tree with a reloc tree. Any extent buffers that
    get initialized here are initialized with the reloc root lockdep key.
    However since it is a snapshot, any blocks that are currently in cache
    that originally belonged to the fs tree will have the normal tree
    lockdep key set. This creates the lock dependency of

    reloc tree -> normal tree

    for the extent buffer locking during the first phase of the relocation
    as we walk down the reloc root to relocate blocks.

    However this is problematic because the final phase of the relocation is
    merging the reloc root into the original fs root. This involves
    searching down to any keys that exist in the original fs root and then
    swapping the relocated block and the original fs root block. We have to
    search down to the fs root first, and then go search the reloc root for
    the block we need to replace. This creates the dependency of

    normal tree -> reloc tree

    which is why lockdep complains.

    Additionally even if we were to fix this particular mismatch with a
    different nesting for the merge case, we're still slotting in a block
    that has a owner of the reloc root objectid into a normal tree, so that
    block will have its lockdep key set to the tree reloc root, and create a
    lockdep splat later on when we wander into that block from the fs root.

    Unfortunately the only solution here is to make sure we do not set the
    lockdep key to the reloc tree lockdep key normally, and then reset any
    blocks we wander into from the reloc root when we're doing the merged.

    This solves the problem of having mixed tree reloc keys intermixed with
    normal tree keys, and then allows us to make sure in the merge case we
    maintain the lock order of

    normal tree -> reloc tree

    We handle this by setting a bit on the reloc root when we do the search
    for the block we want to relocate, and any block we search into or COW
    at that point gets set to the reloc tree key. This works correctly
    because we only ever COW down to the parent node, so we aren't resetting
    the key for the block we're linking into the fs root.

    With this patch we no longer have the lockdep splat in btrfs/187.

    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik
     
  • [ Upstream commit 0a27a0474d146eb79e09ec88bf0d4229f4cfc1b8 ]

    These definitions exist in disk-io.c, which is not related to the
    locking. Move this over to locking.h/c where it makes more sense.

    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik
     
  • [ Upstream commit 17661ecf6a64eb11ae7f1108fe88686388b2acd5 ]

    When smb client open file in ksmbd share with O_TRUNC, dos attribute
    xattr is removed as well as data in file. This cause the FSCTL_SET_SPARSE
    request from the client fails because ksmbd can't update the dos attribute
    after setting ATTR_SPARSE_FILE. And this patch fix xfstests generic/469
    test also.

    Signed-off-by: Namjae Jeon
    Reviewed-by: Hyunchul Lee
    Signed-off-by: Steve French
    Signed-off-by: Sasha Levin

    Namjae Jeon
     
  • [ Upstream commit fe54833dc8d97ef387e86f7c80537d51c503ca75 ]

    If share is not configured in smb.conf, smb2 tree connect should return
    STATUS_BAD_NETWORK_NAME instead of STATUS_BAD_NETWORK_PATH.

    Signed-off-by: Namjae Jeon
    Reviewed-by: Hyunchul Lee
    Signed-off-by: Steve French
    Signed-off-by: Sasha Levin

    Namjae Jeon
     
  • [ Upstream commit 42f86b1226a42bfc79a7125af435432ad4680a32 ]

    In some cases xattr is too fragmented,
    so we need to load it before writing.

    Signed-off-by: Konstantin Komarov
    Signed-off-by: Sasha Levin

    Konstantin Komarov
     
  • [ Upstream commit 769030e11847c5412270c0726ff21d3a1f0a3131 ]

    During log replay, at add_link(), we may increment the link count of
    another inode that has a reference that conflicts with a new reference
    for the inode currently being processed.

    During log replay, at add_link(), we may drop (unlink) a reference from
    some inode in the subvolume tree if that reference conflicts with a new
    reference found in the log for the inode we are currently processing.

    After the unlink, If the link count has decreased from 1 to 0, then we
    increment the link count to prevent the inode from being deleted if it's
    evicted by an iput() call, because we may have references to add to that
    inode later on (and we will fixup its link count later during log replay).

    However incrementing the link count from 0 to 1 triggers a warning:

    $ cat fs/inode.c
    (...)
    void inc_nlink(struct inode *inode)
    {
    if (unlikely(inode->i_nlink == 0)) {
    WARN_ON(!(inode->i_state & I_LINKABLE));
    atomic_long_dec(&inode->i_sb->s_remove_count);
    }
    (...)

    The I_LINKABLE flag is only set when creating an O_TMPFILE file, so it's
    never set during log replay.

    Most of the time, the warning isn't triggered even if we dropped the last
    reference of the conflicting inode, and this is because:

    1) The conflicting inode was previously marked for fixup, through a call
    to link_to_fixup_dir(), which increments the inode's link count;

    2) And the last iput() on the inode has not triggered eviction of the
    inode, nor was eviction triggered after the iput(). So at add_link(),
    even if we unlink the last reference of the inode, its link count ends
    up being 1 and not 0.

    So this means that if eviction is triggered after link_to_fixup_dir() is
    called, at add_link() we will read the inode back from the subvolume tree
    and have it with a correct link count, matching the number of references
    it has on the subvolume tree. So if when we are at add_link() the inode
    has exactly one reference only, its link count is 1, and after the unlink
    its link count becomes 0.

    So fix this by using set_nlink() instead of inc_nlink(), as the former
    accepts a transition from 0 to 1 and it's what we use in other similar
    contexts (like at link_to_fixup_dir().

    Also make add_inode_ref() use set_nlink() instead of inc_nlink() to
    bump the link count from 0 to 1.

    The warning is actually harmless, but it may scare users. Josef also ran
    into it recently.

    CC: stable@vger.kernel.org # 5.1+
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     
  • [ Upstream commit 313ab75399d0c7d0ebc718c545572c1b4d8d22ef ]

    During log replay there is this pattern of running delayed items after
    every inode unlink. To avoid repeating this several times, move the
    logic into an helper function and use it instead of calling
    btrfs_unlink_inode() followed by btrfs_run_delayed_items().

    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     
  • [ Upstream commit ccae4a19c9140a34a0c5f0658812496dd8bbdeaf ]

    Now that we log only dir index keys when logging a directory, we no longer
    need to deal with dir item keys in the log replay code for replaying
    directory deletes. This is also true for the case when we replay a log
    tree created by a kernel that still logs dir items.

    So remove the remaining code of the replay of directory deletes algorithm
    that deals with dir item keys.

    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     
  • [ Upstream commit 4467af8809299c12529b5c21481c1d44a3b209f9 ]

    The root argument passed to btrfs_unlink_inode() and its callee,
    __btrfs_unlink_inode(), always matches the root of the given directory and
    the given inode. So remove the argument and make __btrfs_unlink_inode()
    use the root of the directory.

    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana