29 Jul, 2020

9 commits

  • commit 0e6705182d4e1b77248a93470d6d7b3013d59b30 upstream.

    This reverts commit 9ffad9263b467efd8f8dc7ae1941a0a655a2bab2.

    Upon additional testing with older servers, it was found that
    the original commit introduced a regression when using the old SMB1
    dialect and rsyncing over an existing file.

    The patch will need to be respun to address this, likely including
    a larger refactoring of the SMB1 and SMB3 rename code paths to make
    it less confusing and also to address some additional rename error
    cases that SMB3 may be able to workaround.

    Signed-off-by: Steve French
    Reported-by: Patrick Fernie
    CC: Stable
    Acked-by: Ronnie Sahlberg
    Acked-by: Pavel Shilovsky
    Acked-by: Zhang Xiaoxu
    Signed-off-by: Greg Kroah-Hartman

    Steve French
     
  • [ Upstream commit 9affa435817711861d774f5626c393c80f16d044 ]

    We hold the cl_lock here, and that's enough to keep stateid's from going
    away, but it's not enough to prevent the files they point to from going
    away. Take fi_lock and a reference and check for NULL, as we do in
    other code.

    Reported-by: NeilBrown
    Fixes: 78599c42ae3c ("nfsd4: add file to display list of client's opens")
    Reviewed-by: NeilBrown
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Sasha Levin

    J. Bruce Fields
     
  • commit 5909ca110b29aa16b23b52b8de8d3bb1035fd738 upstream.

    When locking pages for delalloc, we check if it's dirty and mapping still
    matches. If it does not match, we need to return -EAGAIN and release all
    pages. Only the current page was put though, iterate over all the
    remaining pages too.

    CC: stable@vger.kernel.org # 4.14+
    Reviewed-by: Filipe Manana
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Robbie Ko
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Robbie Ko
     
  • commit 48cfa61b58a1fee0bc49eef04f8ccf31493b7cdd upstream.

    It is possible to cause a btrfs mount to fail by racing it with a slow
    umount. The crux of the sequence is generic_shutdown_super not yet
    calling sop->put_super before btrfs_mount_root calls btrfs_open_devices.
    If that occurs, btrfs_open_devices will decide the opened counter is
    non-zero, increment it, and skip resetting fs_devices->total_rw_bytes to
    0. From here, mount will call sget which will result in grab_super
    trying to take the super block umount semaphore. That semaphore will be
    held by the slow umount, so mount will block. Before up-ing the
    semaphore, umount will delete the super block, resulting in mount's sget
    reliably allocating a new one, which causes the mount path to dutifully
    fill it out, and increment total_rw_bytes a second time, which causes
    the mount to fail, as we see double the expected bytes.

    Here is the sequence laid out in greater detail:

    CPU0 CPU1
    down_write sb->s_umount
    btrfs_kill_super
    kill_anon_super(sb)
    generic_shutdown_super(sb);
    shrink_dcache_for_umount(sb);
    sync_filesystem(sb);
    evict_inodes(sb); // SLOW

    btrfs_mount_root
    btrfs_scan_one_device
    fs_devices = device->fs_devices
    fs_info->fs_devices = fs_devices
    // fs_devices-opened makes this a no-op
    btrfs_open_devices(fs_devices, mode, fs_type)
    s = sget(fs_type, test, set, flags, fs_info);
    find sb in s_instances
    grab_super(sb);
    down_write(&s->s_umount); // blocks

    sop->put_super(sb)
    // sb->fs_devices->opened == 2; no-op
    spin_lock(&sb_lock);
    hlist_del_init(&sb->s_instances);
    spin_unlock(&sb_lock);
    up_write(&sb->s_umount);
    return 0;
    retry lookup
    don't find sb in s_instances (deleted by CPU0)
    s = alloc_super
    return s;
    btrfs_fill_super(s, fs_devices, data)
    open_ctree // fs_devices total_rw_bytes improperly set!
    btrfs_read_chunk_tree
    read_one_dev // increment total_rw_bytes again!!
    super_total_bytes < fs_devices->total_rw_bytes // ERROR!!!

    To fix this, we clear total_rw_bytes from within btrfs_read_chunk_tree
    before the calls to read_one_dev, while holding the sb umount semaphore
    and the uuid mutex.

    To reproduce, it is sufficient to dirty a decent number of inodes, then
    quickly umount and mount.

    for i in $(seq 0 500)
    do
    dd if=/dev/zero of="/mnt/foo/$i" bs=1M count=1
    done
    umount /mnt/foo&
    mount /mnt/foo

    does the trick for me.

    CC: stable@vger.kernel.org # 4.4+
    Signed-off-by: Boris Burkov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Boris Burkov
     
  • commit 580c079b5766ac706f56eec5c79aee4bf929fef6 upstream.

    At btrfs_find_all_roots_safe() we allocate a ulist and set the **roots
    argument to point to it. However if later we fail due to an error returned
    by find_parent_nodes(), we free that ulist but leave a dangling pointer in
    the **roots argument. Upon receiving the error, a caller of this function
    can attempt to free the same ulist again, resulting in an invalid memory
    access.

    One such scenario is during qgroup accounting:

    btrfs_qgroup_account_extents()

    --> calls btrfs_find_all_roots() passes &new_roots (a stack allocated
    pointer) to btrfs_find_all_roots()

    --> btrfs_find_all_roots() just calls btrfs_find_all_roots_safe()
    passing &new_roots to it

    --> allocates ulist and assigns its address to **roots (which
    points to new_roots from btrfs_qgroup_account_extents())

    --> find_parent_nodes() returns an error, so we free the ulist
    and leave **roots pointing to it after returning

    --> btrfs_qgroup_account_extents() sees btrfs_find_all_roots() returned
    an error and jumps to the label 'cleanup', which just tries to
    free again the same ulist

    Stack trace example:

    ------------[ cut here ]------------
    BTRFS: tree first key check failed
    WARNING: CPU: 1 PID: 1763215 at fs/btrfs/disk-io.c:422 btrfs_verify_level_key+0xe0/0x180 [btrfs]
    Modules linked in: dm_snapshot dm_thin_pool (...)
    CPU: 1 PID: 1763215 Comm: fsstress Tainted: G W 5.8.0-rc3-btrfs-next-64 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_verify_level_key+0xe0/0x180 [btrfs]
    Code: 28 5b 5d (...)
    RSP: 0018:ffffb89b473779a0 EFLAGS: 00010286
    RAX: 0000000000000000 RBX: ffff90397759bf08 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000000027 RDI: 00000000ffffffff
    RBP: ffff9039a419c000 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: ffffb89b43301000 R12: 000000000000005e
    R13: ffffb89b47377a2e R14: ffffb89b473779af R15: 0000000000000000
    FS: 00007fc47e1e1000(0000) GS:ffff9039ac200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc47e1df000 CR3: 00000003d9e4e001 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    read_block_for_search+0xf6/0x350 [btrfs]
    btrfs_next_old_leaf+0x242/0x650 [btrfs]
    resolve_indirect_refs+0x7cf/0x9e0 [btrfs]
    find_parent_nodes+0x4ea/0x12c0 [btrfs]
    btrfs_find_all_roots_safe+0xbf/0x130 [btrfs]
    btrfs_qgroup_account_extents+0x9d/0x390 [btrfs]
    btrfs_commit_transaction+0x4f7/0xb20 [btrfs]
    btrfs_sync_file+0x3d4/0x4d0 [btrfs]
    do_fsync+0x38/0x70
    __x64_sys_fdatasync+0x13/0x20
    do_syscall_64+0x5c/0xe0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7fc47e2d72e3
    Code: Bad RIP value.
    RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
    RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3
    RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003
    RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003
    R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8
    R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x755/0x1eb0
    softirqs last enabled at (0): [] copy_process+0x755/0x1eb0
    softirqs last disabled at (0): [] 0x0
    ---[ end trace 8639237550317b48 ]---
    BTRFS error (device sdc): tree first key mismatch detected, bytenr=62324736 parent_transid=94 key expected=(262,108,1351680) has=(259,108,1921024)
    general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6b6b: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
    CPU: 2 PID: 1763215 Comm: fsstress Tainted: G W 5.8.0-rc3-btrfs-next-64 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:ulist_release+0x14/0x60 [btrfs]
    Code: c7 07 00 (...)
    RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282
    RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840
    RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840
    R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840
    FS: 00007fc47e1e1000(0000) GS:ffff9039ac600000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f8c1c0a51c8 CR3: 00000003d9e4e004 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    ulist_free+0x13/0x20 [btrfs]
    btrfs_qgroup_account_extents+0xf3/0x390 [btrfs]
    btrfs_commit_transaction+0x4f7/0xb20 [btrfs]
    btrfs_sync_file+0x3d4/0x4d0 [btrfs]
    do_fsync+0x38/0x70
    __x64_sys_fdatasync+0x13/0x20
    do_syscall_64+0x5c/0xe0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7fc47e2d72e3
    Code: Bad RIP value.
    RSP: 002b:00007fffa32098c8 EFLAGS: 00000246 ORIG_RAX: 000000000000004b
    RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fc47e2d72e3
    RDX: 00007fffa3209830 RSI: 00007fffa3209830 RDI: 0000000000000003
    RBP: 000000000000072e R08: 0000000000000001 R09: 0000000000000003
    R10: 0000000000000000 R11: 0000000000000246 R12: 00000000000003e8
    R13: 0000000051eb851f R14: 00007fffa3209970 R15: 00005607c4ac8b50
    Modules linked in: dm_snapshot dm_thin_pool (...)
    ---[ end trace 8639237550317b49 ]---
    RIP: 0010:ulist_release+0x14/0x60 [btrfs]
    Code: c7 07 00 (...)
    RSP: 0018:ffffb89b47377d60 EFLAGS: 00010282
    RAX: 6b6b6b6b6b6b6b6b RBX: ffff903959b56b90 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000270024 RDI: ffff9036e2adc840
    RBP: ffff9036e2adc848 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000000 R12: ffff9036e2adc840
    R13: 0000000000000015 R14: ffff9039a419ccf8 R15: ffff90395d605840
    FS: 00007fc47e1e1000(0000) GS:ffff9039ad200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f6a776f7d40 CR3: 00000003d9e4e002 CR4: 00000000003606e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    Fix this by making btrfs_find_all_roots_safe() set *roots to NULL after
    it frees the ulist.

    Fixes: 8da6d5815c592b ("Btrfs: added btrfs_find_all_roots()")
    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     
  • commit 1dae7e0e58b484eaa43d530f211098fdeeb0f404 upstream.

    [BUG]
    There are several reported runaway balance, that balance is flooding the
    log with "found X extents" where the X never changes.

    [CAUSE]
    Commit d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after
    merge_reloc_roots") introduced BTRFS_ROOT_DEAD_RELOC_TREE bit to
    indicate that one subvolume has finished its tree blocks swap with its
    reloc tree.

    However if balance is canceled or hits ENOSPC halfway, we didn't clear
    the BTRFS_ROOT_DEAD_RELOC_TREE bit, leaving that bit hanging forever
    until unmount.

    Any subvolume root with that bit, would cause backref cache to skip this
    tree block, as it has finished its tree block swap. This would cause
    all tree blocks of that root be ignored by balance, leading to runaway
    balance.

    [FIX]
    Fix the problem by also clearing the BTRFS_ROOT_DEAD_RELOC_TREE bit for
    the original subvolume of orphan reloc root.

    Add an umount check for the stale bit still set.

    Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba
    [Manually solve the conflicts due to no btrfs root refs rework]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     
  • commit 51415b6c1b117e223bc083e30af675cb5c5498f3 upstream.

    [BUG]
    When balance is canceled, there is a pretty high chance that unmounting
    the fs can lead to lead the NULL pointer dereference:

    BTRFS warning (device dm-3): page private not zero on page 223158272
    ...
    BTRFS warning (device dm-3): page private not zero on page 223162368
    BTRFS error (device dm-3): leaked root 18446744073709551608-304 refcount 1
    BUG: kernel NULL pointer dereference, address: 0000000000000168
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: 0000 [#1] PREEMPT SMP NOPTI
    CPU: 2 PID: 5793 Comm: umount Tainted: G O 5.7.0-rc5-custom+ #53
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
    RIP: 0010:__lock_acquire+0x5dc/0x24c0
    Call Trace:
    lock_acquire+0xab/0x390
    _raw_spin_lock+0x39/0x80
    btrfs_release_extent_buffer_pages+0xd7/0x200 [btrfs]
    release_extent_buffer+0xb2/0x170 [btrfs]
    free_extent_buffer+0x66/0xb0 [btrfs]
    btrfs_put_root+0x8e/0x130 [btrfs]
    btrfs_check_leaked_roots.cold+0x5/0x5d [btrfs]
    btrfs_free_fs_info+0xe5/0x120 [btrfs]
    btrfs_kill_super+0x1f/0x30 [btrfs]
    deactivate_locked_super+0x3b/0x80
    deactivate_super+0x3e/0x50
    cleanup_mnt+0x109/0x160
    __cleanup_mnt+0x12/0x20
    task_work_run+0x67/0xa0
    exit_to_usermode_loop+0xc5/0xd0
    syscall_return_slowpath+0x205/0x360
    do_syscall_64+0x6e/0xb0
    entry_SYSCALL_64_after_hwframe+0x49/0xb3
    RIP: 0033:0x7fd028ef740b

    [CAUSE]
    When balance is canceled, all reloc roots are marked as orphan, and
    orphan reloc roots are going to be cleaned up.

    However for orphan reloc roots and merged reloc roots, their lifespan
    are quite different:

    Merged reloc roots | Orphan reloc roots by cancel
    --------------------------------------------------------------------
    create_reloc_root() | create_reloc_root()
    |- refs == 1 | |- refs == 1
    |
    btrfs_grab_root(reloc_root); | btrfs_grab_root(reloc_root);
    |- refs == 2 | |- refs == 2
    |
    root->reloc_root = reloc_root; | root->reloc_root = reloc_root;
    >>> No difference so far <<<
    |
    prepare_to_merge() | prepare_to_merge()
    |- btrfs_set_root_refs(item, 1);| |- if (!err) (err == -EINTR)
    |
    merge_reloc_roots() | merge_reloc_roots()
    |- merge_reloc_root() | |- Doing nothing to put reloc root
    |- insert_dirty_subvol() | |- refs == 2
    |- __del_reloc_root() |
    |- btrfs_put_root() |
    |- refs == 1 |
    >>> Now orphan reloc roots still have refs 2 <<<
    |
    clean_dirty_subvols() | clean_dirty_subvols()
    |- btrfs_drop_snapshot() | |- btrfS_drop_snapshot()
    |- reloc_root get freed | |- reloc_root still has refs 2
    | related ebs get freed, but
    | reloc_root still recorded in
    | allocated_roots
    btrfs_check_leaked_roots() | btrfs_check_leaked_roots()
    |- No leaked roots | |- Leaked reloc_roots detected
    | |- btrfs_put_root()
    | |- free_extent_buffer(root->node);
    | |- eb already freed, caused NULL
    | pointer dereference

    [FIX]
    The fix is to clear fs_root->reloc_root and put it at
    merge_reloc_roots() time, so that we won't leak reloc roots.

    Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
    CC: stable@vger.kernel.org # 5.1+
    Tested-by: Johannes Thumshirn
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba
    [Manually solve the conflicts due to no btrfs root refs rework]
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     
  • commit 65caafd0d2145d1dd02072c4ced540624daeab40 upstream.

    Reverting commit d03727b248d0 "NFSv4 fix CLOSE not waiting for
    direct IO compeletion". This patch made it so that fput() by calling
    inode_dio_done() in nfs_file_release() would wait uninterruptably
    for any outstanding directIO to the file (but that wait on IO should
    be killable).

    The problem the patch was also trying to address was REMOVE returning
    ERR_ACCESS because the file is still opened, is supposed to be resolved
    by server returning ERR_FILE_OPEN and not ERR_ACCESS.

    Signed-off-by: Olga Kornievskaia
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Olga Kornievskaia
     
  • commit a5005c3cda6eeb6b95645e6cc32f58dafeffc976 upstream.

    When PageWaiters was added, updating this check was missed.

    Reported-by: Nikolaus Rath
    Reported-by: Hugh Dickins
    Fixes: 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit")
    Signed-off-by: Miklos Szeredi
    Signed-off-by: André Almeida
    Signed-off-by: Sasha Levin

    Miklos Szeredi
     

22 Jul, 2020

11 commits

  • commit 31070f6ccec09f3bd4f1e28cd1e592fa4f3ba0b6 upstream.

    The ioctl encoding for this parameter is a long but the documentation says
    it should be an int and the kernel drivers expect it to be an int. If the
    fuse driver treats this as a long it might end up scribbling over the stack
    of a userspace process that only allocated enough space for an int.

    This was previously discussed in [1] and a patch for fuse was proposed in
    [2]. From what I can tell the patch in [2] was nacked in favor of adding
    new, "fixed" ioctls and using those from userspace. However there is still
    no "fixed" version of these ioctls and the fact is that it's sometimes
    infeasible to change all userspace to use the new one.

    Handling the ioctls specially in the fuse driver seems like the most
    pragmatic way for fuse servers to support them without causing crashes in
    userspace applications that call them.

    [1]: https://lore.kernel.org/linux-fsdevel/20131126200559.GH20559@hall.aurel32.net/T/
    [2]: https://sourceforge.net/p/fuse/mailman/message/31771759/

    Signed-off-by: Chirantan Ekbote
    Fixes: 59efec7b9039 ("fuse: implement ioctl support")
    Cc:
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Chirantan Ekbote
     
  • commit 0189a2d367f49729622fdafaef5da73161591859 upstream.

    s_op->remount_fs() is only called from legacy_reconfigure(), which is not
    used after being converted to the new API.

    Convert to using ->reconfigure(). This restores the previous behavior of
    syncing the filesystem and rejecting MS_MANDLOCK on remount.

    Fixes: c30da2e981a7 ("fuse: convert to use the new mount API")
    Cc: # v5.4
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     
  • commit e8b20a474cf2c42698d1942f939ff2128819f151 upstream.

    The command

    mount -o remount -o unknownoption /mnt/fuse

    succeeds on kernel versions prior to v5.4 and fails on kernel version at or
    after. This is because fuse_parse_param() rejects any unrecognised options
    in case of FS_CONTEXT_FOR_RECONFIGURE, just as for FS_CONTEXT_FOR_MOUNT.

    This causes a regression in case the fuse filesystem is in fstab, since
    remount sends all options found there to the kernel; even ones that are
    meant for the initial mount and are consumed by the userspace fuse server.

    Fix this by ignoring mount options, just as fuse_remount_fs() did prior to
    the conversion to the new API.

    Reported-by: Stefan Priebe
    Fixes: c30da2e981a7 ("fuse: convert to use the new mount API")
    Cc: # v5.4
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Miklos Szeredi
     
  • commit 81a33c1ee941c3bb9ffc6bac8f676be13351344e upstream.

    The check if user has changed the overlay file was wrong, causing unneeded
    call to ovl_change_flags() including taking f_lock on every file access.

    Fixes: d989903058a8 ("ovl: do not generate duplicate fsnotify events for "fake" path")
    Cc: # v4.19+
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Amir Goldstein
     
  • commit 124c2de2c0aee96271e4ddab190083d8aa7aa71a upstream.

    Decoding a lower directory file handle to overlay path with cold
    inode/dentry cache may go as follows:

    1. Decode real lower file handle to lower dir path
    2. Check if lower dir is indexed (was copied up)
    3. If indexed, get the upper dir path from index
    4. Lookup upper dir path in overlay
    5. If overlay path found, verify that overlay lower is the lower dir
    from step 1

    On failure to verify step 5 above, user will get an ESTALE error and a
    WARN_ON will be printed.

    A mismatch in step 5 could be a result of lower directory that was renamed
    while overlay was offline, after that lower directory has been copied up
    and indexed.

    This is a scripted reproducer based on xfstest overlay/052:

    # Create lower subdir
    create_dirs
    create_test_files $lower/lowertestdir/subdir
    mount_dirs
    # Copy up lower dir and encode lower subdir file handle
    touch $SCRATCH_MNT/lowertestdir
    test_file_handles $SCRATCH_MNT/lowertestdir/subdir -p -o $tmp.fhandle
    # Rename lower dir offline
    unmount_dirs
    mv $lower/lowertestdir $lower/lowertestdir.new/
    mount_dirs
    # Attempt to decode lower subdir file handle
    test_file_handles $SCRATCH_MNT -p -i $tmp.fhandle

    Since this WARN_ON() can be triggered by user we need to relax it.

    Fixes: 4b91c30a5a19 ("ovl: lookup connected ancestor of dir in inode cache")
    Cc: # v4.16+
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Amir Goldstein
     
  • commit 24f14009b8f1754ec2ae4c168940c01259b0f88a upstream.

    When "ovl_is_inuse" true case, trap inode reference not put. plus adding
    the comment explaining sequence of ovl_is_inuse after ovl_setup_trap.

    Fixes: 0be0bfd2de9d ("ovl: fix regression caused by overlapping layers detection")
    Cc: # v4.19+
    Reviewed-by: Amir Goldstein
    Signed-off-by: youngjun
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    youngjun
     
  • commit a888db310195400f050b89c47673f0f8babfbb41 upstream.

    Commit 9df085f3c9a2 ("ovl: relax requirement for non null uuid of lower
    fs") relaxed the requirement for non null uuid with single lower layer to
    allow enabling index and nfs_export features with single lower squashfs.

    Fabian reported a regression in a setup when overlay re-uses an existing
    upper layer and re-formats the lower squashfs image. Because squashfs
    has no uuid, the origin xattr in upper layer are decoded from the new
    lower layer where they may resolve to a wrong origin file and user may
    get an ESTALE or EIO error on lookup.

    To avoid the reported regression while still allowing the new features
    with single lower squashfs, do not allow decoding origin with lower null
    uuid unless user opted-in to one of the new features that require
    following the lower inode of non-dir upper (index, xino, metacopy).

    Reported-by: Fabian
    Link: https://lore.kernel.org/linux-unionfs/32532923.JtPX5UtSzP@fgdesktop/
    Fixes: 9df085f3c9a2 ("ovl: relax requirement for non null uuid of lower fs")
    Cc: stable@vger.kernel.org # v4.20+
    Signed-off-by: Amir Goldstein
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Greg Kroah-Hartman

    Amir Goldstein
     
  • [ Upstream commit 7779b047a57f6824a43d0e1f70de2741b7426b9d ]

    fuse_writepages() ignores some errors taken from fuse_writepages_fill() I
    believe it is a bug: if .writepages is called with WB_SYNC_ALL it should
    either guarantee that all data was successfully saved or return error.

    Fixes: 26d614df1da9 ("fuse: Implement writepages callback")
    Signed-off-by: Vasily Averin
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Sasha Levin

    Vasily Averin
     
  • [ Upstream commit 913fadc5b105c3619d9e8d0fe8899ff1593cc737 ]

    We used to do this before 3453d5708b33, but this was changed to better
    handle the NFS4ERR_SEQ_MISORDERED error code. This commit fixed the slot
    re-use case when the server doesn't receive the interrupted operation,
    but if the server does receive the operation then it could still end up
    replying to the client with mis-matched operations from the reply cache.

    We can fix this by sending a SEQUENCE to the server while recovering from
    a SEQ_MISORDERED error when we detect that we are in an interrupted slot
    situation.

    Fixes: 3453d5708b33 (NFSv4.1: Avoid false retries when RPC calls are interrupted)
    Signed-off-by: Anna Schumaker
    Signed-off-by: Sasha Levin

    Anna Schumaker
     
  • [ Upstream commit b780cc615ba4795a7ef0e93b19424828a5ad456a ]

    Before this patch, only read-write mounts would grab the freeze
    glock in read-only mode, as part of gfs2_make_fs_rw. So the freeze
    glock was never initialized. That meant requests to freeze, which
    request the glock in EX, were granted without any state transition.
    That meant you could mount a gfs2 file system, which is currently
    frozen on a different cluster node, in read-only mode.

    This patch makes read-only mounts lock the freeze glock in SH mode,
    which will block for file systems that are frozen on another node.

    Signed-off-by: Bob Peterson
    Signed-off-by: Sasha Levin

    Bob Peterson
     
  • [ Upstream commit 19e888678bac8c82206eb915eaf72741b2a2615c ]

    The wait_event_... defines evaluate to long so we should not assign it an int as this may truncate
    the value.

    Reported-by: Marshall Midden
    Signed-off-by: Ronnie Sahlberg
    Signed-off-by: Steve French
    Signed-off-by: Sasha Levin

    Ronnie Sahlberg
     

16 Jul, 2020

3 commits

  • commit 230ed397435e85b54f055c524fcb267ae2ce3bc4 upstream.

    While debugging a patch that I wrote I was hitting use-after-free panics
    when accessing block groups on unmount. This turned out to be because
    in the nocow case if we bail out of doing the nocow for whatever reason
    we need to call btrfs_dec_nocow_writers() if we called the inc. This
    puts our block group, but a few error cases does

    if (nocow) {
    btrfs_dec_nocow_writers();
    goto error;
    }

    unfortunately, error is

    error:
    if (nocow)
    btrfs_dec_nocow_writers();

    so we get a double put on our block group. Fix this by dropping the
    error cases calling of btrfs_dec_nocow_writers(), as it's handled at the
    error label now.

    Fixes: 762bf09893b4 ("btrfs: improve error handling in run_delalloc_nocow")
    CC: stable@vger.kernel.org # 5.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Josef Bacik
     
  • commit 6bf9cd2eed9aee6d742bb9296c994a91f5316949 upstream.

    Under somewhat convoluted conditions, it is possible to attempt to
    release an extent_buffer that is under io, which triggers a BUG_ON in
    btrfs_release_extent_buffer_pages.

    This relies on a few different factors. First, extent_buffer reads done
    as readahead for searching use WAIT_NONE, so they free the local extent
    buffer reference while the io is outstanding. However, they should still
    be protected by TREE_REF. However, if the system is doing signficant
    reclaim, and simultaneously heavily accessing the extent_buffers, it is
    possible for releasepage to race with two concurrent readahead attempts
    in a way that leaves TREE_REF unset when the readahead extent buffer is
    released.

    Essentially, if two tasks race to allocate a new extent_buffer, but the
    winner who attempts the first io is rebuffed by a page being locked
    (likely by the reclaim itself) then the loser will still go ahead with
    issuing the readahead. The loser's call to find_extent_buffer must also
    race with the reclaim task reading the extent_buffer's refcount as 1 in
    a way that allows the reclaim to re-clear the TREE_REF checked by
    find_extent_buffer.

    The following represents an example execution demonstrating the race:

    CPU0 CPU1 CPU2
    reada_for_search reada_for_search
    readahead_tree_block readahead_tree_block
    find_create_tree_block find_create_tree_block
    alloc_extent_buffer alloc_extent_buffer
    find_extent_buffer // not found
    allocates eb
    lock pages
    associate pages to eb
    insert eb into radix tree
    set TREE_REF, refs == 2
    unlock pages
    read_extent_buffer_pages // WAIT_NONE
    not uptodate (brand new eb)
    lock_page
    if !trylock_page
    goto unlock_exit // not an error
    free_extent_buffer
    release_extent_buffer
    atomic_dec_and_test refs to 1
    find_extent_buffer // found
    try_release_extent_buffer
    take refs_lock
    reads refs == 1; no io
    atomic_inc_not_zero refs to 2
    mark_buffer_accessed
    check_buffer_tree_ref
    // not STALE, won't take refs_lock
    refs == 2; TREE_REF set // no action
    read_extent_buffer_pages // WAIT_NONE
    clear TREE_REF
    release_extent_buffer
    atomic_dec_and_test refs to 1
    unlock_page
    still not uptodate (CPU1 read failed on trylock_page)
    locks pages
    set io_pages > 0
    submit io
    return
    free_extent_buffer
    release_extent_buffer
    dec refs to 0
    delete from radix tree
    btrfs_release_extent_buffer_pages
    BUG_ON(io_pages > 0)!!!

    We observe this at a very low rate in production and were also able to
    reproduce it in a test environment by introducing some spurious delays
    and by introducing probabilistic trylock_page failures.

    To fix it, we apply check_tree_ref at a point where it could not
    possibly be unset by a competing task: after io_pages has been
    incremented. All the codepaths that clear TREE_REF check for io, so they
    would not be able to clear it after this point until the io is done.

    Stack trace, for reference:
    [1417839.424739] ------------[ cut here ]------------
    [1417839.435328] kernel BUG at fs/btrfs/extent_io.c:4841!
    [1417839.447024] invalid opcode: 0000 [#1] SMP
    [1417839.502972] RIP: 0010:btrfs_release_extent_buffer_pages+0x20/0x1f0
    [1417839.517008] Code: ed e9 ...
    [1417839.558895] RSP: 0018:ffffc90020bcf798 EFLAGS: 00010202
    [1417839.570816] RAX: 0000000000000002 RBX: ffff888102d6def0 RCX: 0000000000000028
    [1417839.586962] RDX: 0000000000000002 RSI: ffff8887f0296482 RDI: ffff888102d6def0
    [1417839.603108] RBP: ffff88885664a000 R08: 0000000000000046 R09: 0000000000000238
    [1417839.619255] R10: 0000000000000028 R11: ffff88885664af68 R12: 0000000000000000
    [1417839.635402] R13: 0000000000000000 R14: ffff88875f573ad0 R15: ffff888797aafd90
    [1417839.651549] FS: 00007f5a844fa700(0000) GS:ffff88885f680000(0000) knlGS:0000000000000000
    [1417839.669810] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [1417839.682887] CR2: 00007f7884541fe0 CR3: 000000049f609002 CR4: 00000000003606e0
    [1417839.699037] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [1417839.715187] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [1417839.731320] Call Trace:
    [1417839.737103] release_extent_buffer+0x39/0x90
    [1417839.746913] read_block_for_search.isra.38+0x2a3/0x370
    [1417839.758645] btrfs_search_slot+0x260/0x9b0
    [1417839.768054] btrfs_lookup_file_extent+0x4a/0x70
    [1417839.778427] btrfs_get_extent+0x15f/0x830
    [1417839.787665] ? submit_extent_page+0xc4/0x1c0
    [1417839.797474] ? __do_readpage+0x299/0x7a0
    [1417839.806515] __do_readpage+0x33b/0x7a0
    [1417839.815171] ? btrfs_releasepage+0x70/0x70
    [1417839.824597] extent_readpages+0x28f/0x400
    [1417839.833836] read_pages+0x6a/0x1c0
    [1417839.841729] ? startup_64+0x2/0x30
    [1417839.849624] __do_page_cache_readahead+0x13c/0x1a0
    [1417839.860590] filemap_fault+0x6c7/0x990
    [1417839.869252] ? xas_load+0x8/0x80
    [1417839.876756] ? xas_find+0x150/0x190
    [1417839.884839] ? filemap_map_pages+0x295/0x3b0
    [1417839.894652] __do_fault+0x32/0x110
    [1417839.902540] __handle_mm_fault+0xacd/0x1000
    [1417839.912156] handle_mm_fault+0xaa/0x1c0
    [1417839.921004] __do_page_fault+0x242/0x4b0
    [1417839.930044] ? page_fault+0x8/0x30
    [1417839.937933] page_fault+0x1e/0x30
    [1417839.945631] RIP: 0033:0x33c4bae
    [1417839.952927] Code: Bad RIP value.
    [1417839.960411] RSP: 002b:00007f5a844f7350 EFLAGS: 00010206
    [1417839.972331] RAX: 000000000000006e RBX: 1614b3ff6a50398a RCX: 0000000000000000
    [1417839.988477] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000002
    [1417840.004626] RBP: 00007f5a844f7420 R08: 000000000000006e R09: 00007f5a94aeccb8
    [1417840.020784] R10: 00007f5a844f7350 R11: 0000000000000000 R12: 00007f5a94aecc79
    [1417840.036932] R13: 00007f5a94aecc78 R14: 00007f5a94aecc90 R15: 00007f5a94aecc40

    CC: stable@vger.kernel.org # 4.4+
    Reviewed-by: Filipe Manana
    Signed-off-by: Boris Burkov
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Boris Burkov
     
  • [ Upstream commit 5618303d8516f8ac5ecfe53ee8e8bc9a40eaf066 ]

    As the man description of the truncate, if the size changed,
    then the st_ctime and st_mtime fields should be updated. But
    in cifs, we doesn't do it.

    It lead the xfstests generic/313 failed.

    So, add the ATTR_MTIME|ATTR_CTIME flags on attrs when change
    the file size

    Reported-by: Hulk Robot
    Signed-off-by: Zhang Xiaoxu
    Signed-off-by: Steve French
    Signed-off-by: Sasha Levin

    Zhang Xiaoxu
     

09 Jul, 2020

10 commits

  • commit 9ffad9263b467efd8f8dc7ae1941a0a655a2bab2 upstream.

    When xfstest generic/035, we found the target file was deleted
    if the rename return -EACESS.

    In cifs_rename2, we unlink the positive target dentry if rename
    failed with EACESS or EEXIST, even if the target dentry is positived
    before rename. Then the existing file was deleted.

    We should just delete the target file which created during the
    rename.

    Reported-by: Hulk Robot
    Signed-off-by: Zhang Xiaoxu
    Cc: stable@vger.kernel.org
    Signed-off-by: Steve French
    Reviewed-by: Aurelien Aptel
    Signed-off-by: Greg Kroah-Hartman

    Zhang Xiaoxu
     
  • commit 6b356f6cf941d5054d7fab072cae4a5f8658e3db upstream.

    Fixes: ca567eb2b3f0 ("SMB3: Allow persistent handle timeout to be configurable on mount")
    Signed-off-by: Paul Aurich
    CC: Stable
    Signed-off-by: Steve French
    Reviewed-by: Aurelien Aptel
    Signed-off-by: Greg Kroah-Hartman

    Paul Aurich
     
  • commit ad35f169db6cd5a4c5c0a5a42fb0cad3efeccb83 upstream.

    Fixes: 3e7a02d47872 ("smb3: allow disabling requesting leases")
    Signed-off-by: Paul Aurich
    CC: Stable
    Signed-off-by: Steve French
    Reviewed-by: Aurelien Aptel
    Signed-off-by: Greg Kroah-Hartman

    Paul Aurich
     
  • commit 00dfbc2f9c61185a2e662f27c45a0bb29b2a134f upstream.

    Without this:

    - persistent handles will only be enabled for per-user tcons if the
    server advertises the 'Continuous Availabity' capability
    - resilient handles would never be enabled for per-user tcons

    Signed-off-by: Paul Aurich
    CC: Stable
    Signed-off-by: Steve French
    Reviewed-by: Aurelien Aptel
    Signed-off-by: Greg Kroah-Hartman

    Paul Aurich
     
  • commit cc15461c73d7d044d56c47e869a215e49bd429c8 upstream.

    Ensure multiuser SMB3 mounts use encryption for all users' tcons if the
    mount options are configured to require encryption. Without this, only
    the primary tcon and IPC tcons are guaranteed to be encrypted. Per-user
    tcons would only be encrypted if the server was configured to require
    encryption.

    Signed-off-by: Paul Aurich
    CC: Stable
    Signed-off-by: Steve French
    Reviewed-by: Aurelien Aptel
    Signed-off-by: Greg Kroah-Hartman

    Paul Aurich
     
  • commit 22cf8419f1319ff87ec759d0ebdff4cbafaee832 upstream.

    The server is failing to apply the umask when creating new objects on
    filesystems without ACL support.

    To reproduce this, you need to use NFSv4.2 and a client and server
    recent enough to support umask, and you need to export a filesystem that
    lacks ACL support (for example, ext4 with the "noacl" mount option).

    Filesystems with ACL support are expected to take care of the umask
    themselves (usually by calling posix_acl_create).

    For filesystems without ACL support, this is up to the caller of
    vfs_create(), vfs_mknod(), or vfs_mkdir().

    Reported-by: Elliott Mitchell
    Reported-by: Salvatore Bonaccorso
    Tested-by: Salvatore Bonaccorso
    Fixes: 47057abde515 ("nfsd: add support for the umask attribute")
    Cc: stable@vger.kernel.org
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Greg Kroah-Hartman

    J. Bruce Fields
     
  • [ Upstream commit 5391b8e1b7b7e5cfa2dd4ffdc4b8c6b64dfd1866 ]

    The flag from the primary tcon needs to be copied into the volume info
    so that cifs_get_tcon will try to enable extensions on the per-user
    tcon. At that point, since posix extensions must have already been
    enabled on the superblock, don't try to needlessly adjust the mount
    flags.

    Fixes: ce558b0e17f8 ("smb3: Add posix create context for smb3.11 posix mounts")
    Fixes: b326614ea215 ("smb3: allow "posix" mount option to enable new SMB311 protocol extensions")
    Signed-off-by: Paul Aurich
    Signed-off-by: Steve French
    Reviewed-by: Aurelien Aptel
    Signed-off-by: Sasha Levin

    Paul Aurich
     
  • [ Upstream commit bf2654017e0268cc83dc88d56f0e67ff4406631d ]

    I don't understand this code well, but I'm seeing a warning about a
    still-referenced inode on unmount, and every other similar filesystem
    does a dput() here.

    Fixes: e8a79fb14f6b ("nfsd: add nfsd/clients directory")
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Sasha Levin

    J. Bruce Fields
     
  • [ Upstream commit 681370f4b00af0fcc65bbfb9f82de526ab7ceb0a ]

    We don't drop the reference on the nfsdfs filesystem with
    mntput(nn->nfsd_mnt) until nfsd_exit_net(), but that won't be called
    until the nfsd module's unloaded, and we can't unload the module as long
    as there's a reference on nfsdfs. So this prevents module unloading.

    Fixes: 2c830dd7209b ("nfsd: persist nfsd filesystem across mounts")
    Reported-and-Tested-by: Luo Xiaogang
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Sasha Levin

    J. Bruce Fields
     
  • Track async work items that we queue, so we can safely cancel them
    if the ring is closed or the process exits. Newer kernels handle
    this automatically with io-wq, but the old workqueue based setup needs
    a bit of special help to get there.

    There's no upstream variant of this, as that would require backporting
    all the io-wq changes from 5.5 and on. Hence I made a one-off that
    ensures that we don't leak memory if we have async work items that
    need active cancelation (like socket IO).

    Reported-by: Agarwal, Anchal
    Tested-by: Agarwal, Anchal
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Jens Axboe
     

01 Jul, 2020

7 commits

  • [ Upstream commit d0c7feaf87678371c2c09b3709400be416b2dc62 ]

    We recently used fuzz(hydra) to test XFS and automatically generate
    tmp.img(XFS v5 format, but some metadata is wrong)

    xfs_repair information(just one AG):
    agf_freeblks 0, counted 3224 in ag 0
    agf_longest 536874136, counted 3224 in ag 0
    sb_fdblocks 613, counted 3228

    Test as follows:
    mount tmp.img tmpdir
    cp file1M tmpdir
    sync

    In 4.19-stable, sync will stuck, the reason is:
    xfs_mountfs
    xfs_check_summary_counts
    if ((!xfs_sb_version_haslazysbcount(&mp->m_sb) ||
    XFS_LAST_UNMOUNT_WAS_CLEAN(mp)) &&
    !xfs_fs_has_sickness(mp, XFS_SICK_FS_COUNTERS))
    return 0; -->just return, incore sb_fdblocks still be 613
    xfs_initialize_perag_data

    cp file1M tmpdir -->ok(write file to pagecache)
    sync -->stuck(write pagecache to disk)
    xfs_map_blocks
    xfs_iomap_write_allocate
    while (count_fsb != 0) {
    nimaps = 0;
    while (nimaps == 0) { --> endless loop
    nimaps = 1;
    xfs_bmapi_write(..., &nimaps) --> nimaps becomes 0 again
    xfs_bmapi_write
    xfs_bmap_alloc
    xfs_bmap_btalloc
    xfs_alloc_vextent
    xfs_alloc_fix_freelist
    xfs_alloc_space_available -->fail(agf_freeblks is 0)

    In linux-next, sync not stuck, cause commit c2b3164320b5 ("xfs:
    use the latest extent at writeback delalloc conversion time") remove
    the above while, dmesg is as follows:
    [ 55.250114] XFS (loop0): page discard on page ffffea0008bc7380, inode 0x1b0c, offset 0.

    Users do not know why this page is discard, the better soultion is:
    1. Like xfs_repair, make sure sb_fdblocks is equal to counted
    (xfs_initialize_perag_data did this, who is not called at this mount)
    2. Add agf verify, if fail, will tell users to repair

    This patch use the second soultion.

    Signed-off-by: Zheng Bin
    Signed-off-by: Ren Xudong
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin

    Zheng Bin
     
  • commit d03727b248d0dae6199569a8d7b629a681154633 upstream.

    Figuring out the root case for the REMOVE/CLOSE race and
    suggesting the solution was done by Neil Brown.

    Currently what happens is that direct IO calls hold a reference
    on the open context which is decremented as an asynchronous task
    in the nfs_direct_complete(). Before reference is decremented,
    control is returned to the application which is free to close the
    file. When close is being processed, it decrements its reference
    on the open_context but since directIO still holds one, it doesn't
    sent a close on the wire. It returns control to the application
    which is free to do other operations. For instance, it can delete a
    file. Direct IO is finally releasing its reference and triggering
    an asynchronous close. Which races with the REMOVE. On the server,
    REMOVE can be processed before the CLOSE, failing the REMOVE with
    EACCES as the file is still opened.

    Signed-off-by: Olga Kornievskaia
    Suggested-by: Neil Brown
    CC: stable@vger.kernel.org
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Olga Kornievskaia
     
  • commit 8b04013737341442ed914b336cde866b902664ae upstream.

    If the mirror count changes in the new layout we pick up inside
    ff_layout_pg_init_write(), then we can end up adding the
    request to the wrong mirror and corrupting the mirror->pg_list.

    Fixes: d600ad1f2bdb ("NFS41: pop some layoutget errors to application")
    Cc: stable@vger.kernel.org
    Signed-off-by: Trond Myklebust
    Signed-off-by: Anna Schumaker
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit e5a15e17a78d58f933d17cafedfcf7486a29f5b4 upstream.

    The following kernel panic was captured when running nfs server over
    ocfs2, at that time ocfs2_test_inode_bit() was checking whether one
    inode locating at "blkno" 5 was valid, that is ocfs2 root inode, its
    "suballoc_slot" was OCFS2_INVALID_SLOT(65535) and it was allocted from
    //global_inode_alloc, but here it wrongly assumed that it was got from per
    slot inode alloctor which would cause array overflow and trigger kernel
    panic.

    BUG: unable to handle kernel paging request at 0000000000001088
    IP: [] _raw_spin_lock+0x18/0xf0
    PGD 1e06ba067 PUD 1e9e7d067 PMD 0
    Oops: 0002 [#1] SMP
    CPU: 6 PID: 24873 Comm: nfsd Not tainted 4.1.12-124.36.1.el6uek.x86_64 #2
    Hardware name: Huawei CH121 V3/IT11SGCA1, BIOS 3.87 02/02/2018
    RIP: _raw_spin_lock+0x18/0xf0
    RSP: e02b:ffff88005ae97908 EFLAGS: 00010206
    RAX: ffff88005ae98000 RBX: 0000000000001088 RCX: 0000000000000000
    RDX: 0000000000020000 RSI: 0000000000000009 RDI: 0000000000001088
    RBP: ffff88005ae97928 R08: 0000000000000000 R09: ffff880212878e00
    R10: 0000000000007ff0 R11: 0000000000000000 R12: 0000000000001088
    R13: ffff8800063c0aa8 R14: ffff8800650c27d0 R15: 000000000000ffff
    FS: 0000000000000000(0000) GS:ffff880218180000(0000) knlGS:ffff880218180000
    CS: e033 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000001088 CR3: 00000002033d0000 CR4: 0000000000042660
    Call Trace:
    igrab+0x1e/0x60
    ocfs2_get_system_file_inode+0x63/0x3a0 [ocfs2]
    ocfs2_test_inode_bit+0x328/0xa00 [ocfs2]
    ocfs2_get_parent+0xba/0x3e0 [ocfs2]
    reconnect_path+0xb5/0x300
    exportfs_decode_fh+0xf6/0x2b0
    fh_verify+0x350/0x660 [nfsd]
    nfsd4_putfh+0x4d/0x60 [nfsd]
    nfsd4_proc_compound+0x3d3/0x6f0 [nfsd]
    nfsd_dispatch+0xe0/0x290 [nfsd]
    svc_process_common+0x412/0x6a0 [sunrpc]
    svc_process+0x123/0x210 [sunrpc]
    nfsd+0xff/0x170 [nfsd]
    kthread+0xcb/0xf0
    ret_from_fork+0x61/0x90
    Code: 83 c2 02 0f b7 f2 e8 18 dc 91 ff 66 90 eb bf 0f 1f 40 00 55 48 89 e5 41 56 41 55 41 54 53 0f 1f 44 00 00 48 89 fb ba 00 00 02 00 0f c1 17 89 d0 45 31 e4 45 31 ed c1 e8 10 66 39 d0 41 89 c6
    RIP _raw_spin_lock+0x18/0xf0
    CR2: 0000000000001088
    ---[ end trace 7264463cd1aac8f9 ]---
    Kernel panic - not syncing: Fatal exception

    Link: http://lkml.kernel.org/r/20200616183829.87211-4-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Reviewed-by: Joseph Qi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Joel Becker
    Cc: Jun Piao
    Cc: Mark Fasheh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Junxiao Bi
     
  • commit 9277f8334ffc719fe922d776444d6e4e884dbf30 upstream.

    In the ocfs2 disk layout, slot number is 16 bits, but in ocfs2
    implementation, slot number is 32 bits. Usually this will not cause any
    issue, because slot number is converted from u16 to u32, but
    OCFS2_INVALID_SLOT was defined as -1, when an invalid slot number from
    disk was obtained, its value was (u16)-1, and it was converted to u32.
    Then the following checking in get_local_system_inode will be always
    skipped:

    static struct inode **get_local_system_inode(struct ocfs2_super *osb,
    int type,
    u32 slot)
    {
    BUG_ON(slot == OCFS2_INVALID_SLOT);
    ...
    }

    Link: http://lkml.kernel.org/r/20200616183829.87211-5-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Reviewed-by: Joseph Qi
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Jun Piao
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Junxiao Bi
     
  • commit 7569d3c754e452769a5747eeeba488179e38a5da upstream.

    Set global_inode_alloc as OCFS2_FIRST_ONLINE_SYSTEM_INODE, that will
    make it load during mount. It can be used to test whether some
    global/system inodes are valid. One use case is that nfsd will test
    whether root inode is valid.

    Link: http://lkml.kernel.org/r/20200616183829.87211-3-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Reviewed-by: Joseph Qi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Joel Becker
    Cc: Jun Piao
    Cc: Mark Fasheh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Junxiao Bi
     
  • commit 4cd9973f9ff69e37dd0ba2bd6e6423f8179c329a upstream.

    Patch series "ocfs2: fix nfsd over ocfs2 issues", v2.

    This is a series of patches to fix issues on nfsd over ocfs2. patch 1
    is to avoid inode removed while nfsd access it patch 2 & 3 is to fix a
    panic issue.

    This patch (of 4):

    When nfsd is getting file dentry using handle or parent dentry of some
    dentry, one cluster lock is used to avoid inode removed from other node,
    but it still could be removed from local node, so use a rw lock to avoid
    this.

    Link: http://lkml.kernel.org/r/20200616183829.87211-1-junxiao.bi@oracle.com
    Link: http://lkml.kernel.org/r/20200616183829.87211-2-junxiao.bi@oracle.com
    Signed-off-by: Junxiao Bi
    Reviewed-by: Joseph Qi
    Cc: Changwei Ge
    Cc: Gang He
    Cc: Joel Becker
    Cc: Jun Piao
    Cc: Mark Fasheh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Junxiao Bi