07 Oct, 2020

8 commits

  • commit 3701cb59d892b88d569427586f01491552f377b1 upstream.

    or get freed, for that matter, if it's a long (separately stored)
    name.

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit fe0a916c1eae8e17e86c3753d13919177d63ed7e upstream.

    Checking for the lack of epitems refering to the epoll we want to insert into
    is not enough; we might have an insertion of that epoll into another one that
    has already collected the set of files to recheck for excessive reverse paths,
    but hasn't gotten to creating/inserting the epitem for it.

    However, any such insertion in progress can be detected - it will update the
    generation count in our epoll when it's done looking through it for files
    to check. That gets done under ->mtx of our epoll and that allows us to
    detect that safely.

    We are *not* holding epmutex here, so the generation count is not stable.
    However, since both the update of ep->gen by loop check and (later)
    insertion into ->f_ep_link are done with ep->mtx held, we are fine -
    the sequence is
    grab epmutex
    bump loop_check_gen
    ...
    grab tep->mtx // 1
    tep->gen = loop_check_gen
    ...
    drop tep->mtx // 2
    ...
    grab tep->mtx // 3
    ...
    insert into ->f_ep_link
    ...
    drop tep->mtx // 4
    bump loop_check_gen
    drop epmutex
    and if the fastpath check in another thread happens for that
    eventpoll, it can come
    * before (1) - in that case fastpath is just fine
    * after (4) - we'll see non-empty ->f_ep_link, slow path
    taken
    * between (2) and (3) - loop_check_gen is stable,
    with ->mtx providing barriers and we end up taking slow path.

    Note that ->f_ep_link emptiness check is slightly racy - we are protected
    against insertions into that list, but removals can happen right under us.
    Not a problem - in the worst case we'll end up taking a slow path for
    no good reason.

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit 18306c404abe18a0972587a6266830583c60c928 upstream.

    removes the need to clear it, along with the races.

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit f8d4f44df056c5b504b0d49683fb7279218fd207 upstream.

    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • [ Upstream commit d33030e2ee3508d65db5644551435310df86010e ]

    nfs_readdir_page_filler() iterates over entries in a directory, reusing
    the same security label buffer, but does not reset the buffer's length.
    This causes decode_attr_security_label() to return -ERANGE if an entry's
    security label is longer than the previous one's. This error, in
    nfs4_decode_dirent(), only gets passed up as -EAGAIN, which causes another
    failed attempt to copy into the buffer. The second error is ignored and
    the remaining entries do not show up in ls, specifically the getdents64()
    syscall.

    Reproduce by creating multiple files in NFS and giving one of the later
    files a longer security label. ls will not see that file nor any that are
    added afterwards, though they will exist on the backend.

    In nfs_readdir_page_filler(), reset security label buffer length before
    every reuse

    Signed-off-by: Jeffrey Mitchell
    Fixes: b4487b935452 ("nfs: Fix getxattr kernel panic and memory overflow")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Jeffrey Mitchell
     
  • [ Upstream commit 933a3752babcf6513117d5773d2b70782d6ad149 ]

    the callers rely upon having any iov_iter_truncate() done inside
    ->direct_IO() countered by iov_iter_reexpand().

    Reported-by: Qian Cai
    Tested-by: Qian Cai
    Signed-off-by: Al Viro
    Signed-off-by: Sasha Levin

    Al Viro
     
  • A bug existed in the XFS reflink code between v5.1 and v5.5 in which
    the mapping for a COW IO was not trimmed to the mapping of the COW
    extent that was found. This resulted in a too-short copy, and
    corruption of other files which shared the original extent.

    (This happened only when extent size hints were set, which bypasses
    delalloc and led to this code path.)

    This was (inadvertently) fixed upstream with

    36adcbace24e "xfs: fill out the srcmap in iomap_begin"

    and related patches which moved lots of this functionality to
    the iomap subsystem.

    Hence, this is a -stable only patch, targeted to fix this
    corruption vector without other major code changes.

    Fixes: 78f0cc9d55cb ("xfs: don't use delalloc extents for COW on files with extsize hints")
    Cc: # 5.4.x
    Signed-off-by: Eric Sandeen
    Reviewed-by: Darrick J. Wong
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Sasha Levin

    Eric Sandeen
     
  • commit 4c8f353272dd1262013873990c0fafd0e3c8f274 upstream.

    We use a device's allocation state tree to track ranges in a device used
    for allocated chunks, and we set ranges in this tree when allocating a new
    chunk. However after a device replace operation, we were not setting the
    allocated ranges in the new device's allocation state tree, so that tree
    is empty after a device replace.

    This means that a fitrim operation after a device replace will trim the
    device ranges that have allocated chunks and extents, as we trim every
    range for which there is not a range marked in the device's allocation
    state tree. It is also important during chunk allocation, since the
    device's allocation state is used to determine if a range is already
    allocated when allocating a new chunk.

    This is trivial to reproduce and the following script triggers the bug:

    $ cat reproducer.sh
    #!/bin/bash

    DEV1="/dev/sdg"
    DEV2="/dev/sdh"
    DEV3="/dev/sdi"

    wipefs -a $DEV1 $DEV2 $DEV3 &> /dev/null

    # Create a raid1 test fs on 2 devices.
    mkfs.btrfs -f -m raid1 -d raid1 $DEV1 $DEV2 > /dev/null
    mount $DEV1 /mnt/btrfs

    xfs_io -f -c "pwrite -S 0xab 0 10M" /mnt/btrfs/foo

    echo "Starting to replace $DEV1 with $DEV3"
    btrfs replace start -B $DEV1 $DEV3 /mnt/btrfs
    echo

    echo "Running fstrim"
    fstrim /mnt/btrfs
    echo

    echo "Unmounting filesystem"
    umount /mnt/btrfs

    echo "Mounting filesystem in degraded mode using $DEV3 only"
    wipefs -a $DEV1 $DEV2 &> /dev/null
    mount -o degraded $DEV3 /mnt/btrfs
    if [ $? -ne 0 ]; then
    dmesg | tail
    echo
    echo "Failed to mount in degraded mode"
    exit 1
    fi

    echo
    echo "File foo data (expected all bytes = 0xab):"
    od -A d -t x1 /mnt/btrfs/foo

    umount /mnt/btrfs

    When running the reproducer:

    $ ./replace-test.sh
    wrote 10485760/10485760 bytes at offset 0
    10 MiB, 2560 ops; 0.0901 sec (110.877 MiB/sec and 28384.5216 ops/sec)
    Starting to replace /dev/sdg with /dev/sdi

    Running fstrim

    Unmounting filesystem
    Mounting filesystem in degraded mode using /dev/sdi only
    mount: /mnt/btrfs: wrong fs type, bad option, bad superblock on /dev/sdi, missing codepage or helper program, or other error.
    [19581.748641] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi started
    [19581.803842] BTRFS info (device sdg): dev_replace from /dev/sdg (devid 1) to /dev/sdi finished
    [19582.208293] BTRFS info (device sdi): allowing degraded mounts
    [19582.208298] BTRFS info (device sdi): disk space caching is enabled
    [19582.208301] BTRFS info (device sdi): has skinny extents
    [19582.212853] BTRFS warning (device sdi): devid 2 uuid 1f731f47-e1bb-4f00-bfbb-9e5a0cb4ba9f is missing
    [19582.213904] btree_readpage_end_io_hook: 25839 callbacks suppressed
    [19582.213907] BTRFS error (device sdi): bad tree block start, want 30490624 have 0
    [19582.214780] BTRFS warning (device sdi): failed to read root (objectid=7): -5
    [19582.231576] BTRFS error (device sdi): open_ctree failed

    Failed to mount in degraded mode

    So fix by setting all allocated ranges in the replace target device when
    the replace operation is finishing, when we are holding the chunk mutex
    and we can not race with new chunk allocations.

    A test case for fstests follows soon.

    Fixes: 1c11b63eff2a67 ("btrfs: replace pending/pinned chunks lists with io tree")
    CC: stable@vger.kernel.org # 5.2+
    Reviewed-by: Nikolay Borisov
    Reviewed-by: Johannes Thumshirn
    Signed-off-by: Filipe Manana
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Filipe Manana
     

01 Oct, 2020

32 commits

  • commit 35be8851d172c6e3db836c0f28c19087b10c9e00 upstream.

    Syzkaller reported a buffer overflow in btree_readpage_end_io_hook()
    when loop mounting a crafted image:

    detected buffer overflow in memcpy
    ------------[ cut here ]------------
    kernel BUG at lib/string.c:1129!
    invalid opcode: 0000 [#1] PREEMPT SMP KASAN
    CPU: 1 PID: 26 Comm: kworker/u4:2 Not tainted 5.9.0-rc4-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Workqueue: btrfs-endio-meta btrfs_work_helper
    RIP: 0010:fortify_panic+0xf/0x20 lib/string.c:1129
    RSP: 0018:ffffc90000e27980 EFLAGS: 00010286
    RAX: 0000000000000022 RBX: ffff8880a80dca64 RCX: 0000000000000000
    RDX: ffff8880a90860c0 RSI: ffffffff815dba07 RDI: fffff520001c4f22
    RBP: ffff8880a80dca00 R08: 0000000000000022 R09: ffff8880ae7318e7
    R10: 0000000000000000 R11: 0000000000077578 R12: 00000000ffffff6e
    R13: 0000000000000008 R14: ffffc90000e27a40 R15: 1ffff920001c4f3c
    FS: 0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000557335f440d0 CR3: 000000009647d000 CR4: 00000000001506e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    memcpy include/linux/string.h:405 [inline]
    btree_readpage_end_io_hook.cold+0x206/0x221 fs/btrfs/disk-io.c:642
    end_bio_extent_readpage+0x4de/0x10c0 fs/btrfs/extent_io.c:2854
    bio_endio+0x3cf/0x7f0 block/bio.c:1449
    end_workqueue_fn+0x114/0x170 fs/btrfs/disk-io.c:1695
    btrfs_work_helper+0x221/0xe20 fs/btrfs/async-thread.c:318
    process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
    worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
    kthread+0x3b5/0x4a0 kernel/kthread.c:292
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294
    Modules linked in:
    ---[ end trace b68924293169feef ]---
    RIP: 0010:fortify_panic+0xf/0x20 lib/string.c:1129
    RSP: 0018:ffffc90000e27980 EFLAGS: 00010286
    RAX: 0000000000000022 RBX: ffff8880a80dca64 RCX: 0000000000000000
    RDX: ffff8880a90860c0 RSI: ffffffff815dba07 RDI: fffff520001c4f22
    RBP: ffff8880a80dca00 R08: 0000000000000022 R09: ffff8880ae7318e7
    R10: 0000000000000000 R11: 0000000000077578 R12: 00000000ffffff6e
    R13: 0000000000000008 R14: ffffc90000e27a40 R15: 1ffff920001c4f3c
    FS: 0000000000000000(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f95b7c4d008 CR3: 000000009647d000 CR4: 00000000001506e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

    The overflow happens, because in btree_readpage_end_io_hook() we assume
    that we have found a 4 byte checksum instead of the real possible 32
    bytes we have for the checksums.

    With the fix applied:

    [ 35.726623] BTRFS: device fsid 815caf9a-dc43-4d2a-ac54-764b8333d765 devid 1 transid 5 /dev/loop0 scanned by syz-repro (215)
    [ 35.738994] BTRFS info (device loop0): disk space caching is enabled
    [ 35.738998] BTRFS info (device loop0): has skinny extents
    [ 35.743337] BTRFS warning (device loop0): loop0 checksum verify failed on 1052672 wanted 0xf9c035fc8d239a54 found 0x67a25c14b7eabcf9 level 0
    [ 35.743420] BTRFS error (device loop0): failed to read chunk root
    [ 35.745899] BTRFS error (device loop0): open_ctree failed

    Reported-by: syzbot+e864a35d361e1d4e29a5@syzkaller.appspotmail.com
    Fixes: d5178578bcd4 ("btrfs: directly call into crypto framework for checksumming")
    CC: stable@vger.kernel.org # 5.4+
    Signed-off-by: Johannes Thumshirn
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Johannes Thumshirn
     
  • [ Upstream commit fa91e4aa1716004ea8096d5185ec0451e206aea0 ]

    [BUG]
    When running tests like generic/013 on test device with btrfs quota
    enabled, it can normally lead to data leak, detected at unmount time:

    BTRFS warning (device dm-3): qgroup 0/5 has unreleased space, type 0 rsv 4096
    ------------[ cut here ]------------
    WARNING: CPU: 11 PID: 16386 at fs/btrfs/disk-io.c:4142 close_ctree+0x1dc/0x323 [btrfs]
    RIP: 0010:close_ctree+0x1dc/0x323 [btrfs]
    Call Trace:
    btrfs_put_super+0x15/0x17 [btrfs]
    generic_shutdown_super+0x72/0x110
    kill_anon_super+0x18/0x30
    btrfs_kill_super+0x17/0x30 [btrfs]
    deactivate_locked_super+0x3b/0xa0
    deactivate_super+0x40/0x50
    cleanup_mnt+0x135/0x190
    __cleanup_mnt+0x12/0x20
    task_work_run+0x64/0xb0
    __prepare_exit_to_usermode+0x1bc/0x1c0
    __syscall_return_slowpath+0x47/0x230
    do_syscall_64+0x64/0xb0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    ---[ end trace caf08beafeca2392 ]---
    BTRFS error (device dm-3): qgroup reserved space leaked

    [CAUSE]
    In the offending case, the offending operations are:
    2/6: writev f2X[269 1 0 0 0 0] [1006997,67,288] 0
    2/7: truncate f2X[269 1 0 0 48 1026293] 18388 0

    The following sequence of events could happen after the writev():
    CPU1 (writeback) | CPU2 (truncate)
    -----------------------------------------------------------------
    btrfs_writepages() |
    |- extent_write_cache_pages() |
    |- Got page for 1003520 |
    | 1003520 is Dirty, no writeback |
    | So (!clear_page_dirty_for_io()) |
    | gets called for it |
    |- Now page 1003520 is Clean. |
    | | btrfs_setattr()
    | | |- btrfs_setsize()
    | | |- truncate_setsize()
    | | New i_size is 18388
    |- __extent_writepage() |
    | |- page_offset() > i_size |
    |- btrfs_invalidatepage() |
    |- Page is clean, so no qgroup |
    callback executed

    This means, the qgroup reserved data space is not properly released in
    btrfs_invalidatepage() as the page is Clean.

    [FIX]
    Instead of checking the dirty bit of a page, call
    btrfs_qgroup_free_data() unconditionally in btrfs_invalidatepage().

    As qgroup rsv are completely bound to the QGROUP_RESERVED bit of
    io_tree, not bound to page status, thus we won't cause double freeing
    anyway.

    Fixes: 0b34c261e235 ("btrfs: qgroup: Prevent qgroup->reserved from going subzero")
    CC: stable@vger.kernel.org # 4.14+
    Reviewed-by: Josef Bacik
    Signed-off-by: Qu Wenruo
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Qu Wenruo
     
  • [ Upstream commit 95a3d8f3af9b0d63b43f221b630beaab9739d13a ]

    When xfstests generic/451, there is an BUG at mm/memcontrol.c:
    page:ffffea000560f2c0 refcount:2 mapcount:0 mapping:000000008544e0ea
    index:0xf
    mapping->aops:cifs_addr_ops dentry name:"tst-aio-dio-cycle-write.451"
    flags: 0x2fffff80000001(locked)
    raw: 002fffff80000001 ffffc90002023c50 ffffea0005280088 ffff88815cda0210
    raw: 000000000000000f 0000000000000000 00000002ffffffff ffff88817287d000
    page dumped because: VM_BUG_ON_PAGE(page->mem_cgroup)
    page->mem_cgroup:ffff88817287d000
    ------------[ cut here ]------------
    kernel BUG at mm/memcontrol.c:2659!
    invalid opcode: 0000 [#1] SMP
    CPU: 2 PID: 2038 Comm: xfs_io Not tainted 5.8.0-rc1 #44
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_
    073836-buildvm-ppc64le-16.ppc.4
    RIP: 0010:commit_charge+0x35/0x50
    Code: 0d 48 83 05 54 b2 02 05 01 48 89 77 38 c3 48 c7
    c6 78 4a ea ba 48 83 05 38 b2 02 05 01 e8 63 0d9
    RSP: 0018:ffffc90002023a50 EFLAGS: 00010202
    RAX: 0000000000000000 RBX: ffff88817287d000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffff88817ac97ea0 RDI: ffff88817ac97ea0
    RBP: ffffea000560f2c0 R08: 0000000000000203 R09: 0000000000000005
    R10: 0000000000000030 R11: ffffc900020237a8 R12: 0000000000000000
    R13: 0000000000000001 R14: 0000000000000001 R15: ffff88815a1272c0
    FS: 00007f5071ab0800(0000) GS:ffff88817ac80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055efcd5ca000 CR3: 000000015d312000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    mem_cgroup_charge+0x166/0x4f0
    __add_to_page_cache_locked+0x4a9/0x710
    add_to_page_cache_locked+0x15/0x20
    cifs_readpages+0x217/0x1270
    read_pages+0x29a/0x670
    page_cache_readahead_unbounded+0x24f/0x390
    __do_page_cache_readahead+0x3f/0x60
    ondemand_readahead+0x1f1/0x470
    page_cache_async_readahead+0x14c/0x170
    generic_file_buffered_read+0x5df/0x1100
    generic_file_read_iter+0x10c/0x1d0
    cifs_strict_readv+0x139/0x170
    new_sync_read+0x164/0x250
    __vfs_read+0x39/0x60
    vfs_read+0xb5/0x1e0
    ksys_pread64+0x85/0xf0
    __x64_sys_pread64+0x22/0x30
    do_syscall_64+0x69/0x150
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f5071fcb1af
    Code: Bad RIP value.
    RSP: 002b:00007ffde2cdb8e0 EFLAGS: 00000293 ORIG_RAX: 0000000000000011
    RAX: ffffffffffffffda RBX: 00007ffde2cdb990 RCX: 00007f5071fcb1af
    RDX: 0000000000001000 RSI: 000055efcd5ca000 RDI: 0000000000000003
    RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000001000 R11: 0000000000000293 R12: 0000000000000001
    R13: 000000000009f000 R14: 0000000000000000 R15: 0000000000001000
    Modules linked in:
    ---[ end trace 725fa14a3e1af65c ]---

    Since commit 3fea5a499d57 ("mm: memcontrol: convert page cache to a new
    mem_cgroup_charge() API") not cancel the page charge, the pages maybe
    double add to pagecache:
    thread1 | thread2
    cifs_readpages
    readpages_get_pages
    add_to_page_cache_locked(head,index=n)=0
    | readpages_get_pages
    | add_to_page_cache_locked(head,index=n+1)=0
    add_to_page_cache_locked(head, index=n+1)=-EEXIST
    then, will next loop with list head page's
    index=n+1 and the page->mapping not NULL
    readpages_get_pages
    add_to_page_cache_locked(head, index=n+1)
    commit_charge
    VM_BUG_ON_PAGE

    So, we should not do the next loop when any page add to page cache
    failed.

    Reported-by: Hulk Robot
    Signed-off-by: Zhang Xiaoxu
    Signed-off-by: Steve French
    Acked-by: Ronnie Sahlberg
    Signed-off-by: Sasha Levin

    Zhang Xiaoxu
     
  • [ Upstream commit 5be5945864ea143fda628e8179c8474457af1f43 ]

    When sunrpc trace points are not enabled, the recorded task ID
    information alone is not helpful.

    Signed-off-by: Chuck Lever
    Signed-off-by: Anna Schumaker
    Signed-off-by: Sasha Levin

    Chuck Lever
     
  • [ Upstream commit dc3da0461cc4b76f2d0c5b12247fcb3b520edbbf ]

    Nothing ensures that session will still be valid by the time we
    dereference the pointer. Take and put a reference.

    In principle, we should always be able to get a reference here, but
    throw a warning if that's ever not the case.

    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Sasha Levin

    Jeff Layton
     
  • [ Upstream commit c36cac28cb94e58f7e21ff43bdc6064346dab32c ]

    In btrfs_submit_direct(), if we fail to allocate the btrfs_dio_private,
    we complete the ordered extent range. However, we don't mark that the
    range doesn't need to be cleaned up from btrfs_direct_IO() until later.
    Therefore, if we fail to allocate the btrfs_dio_private, we complete the
    ordered extent range twice. We could fix this by updating
    unsubmitted_oe_range earlier, but it's cleaner to reorganize the code so
    that creating the btrfs_dio_private and submitting the bios are
    separate, and once the btrfs_dio_private is created, cleanup always
    happens through the btrfs_dio_private.

    The logic around unsubmitted_oe_range_end and unsubmitted_oe_range_start
    is really subtle. We have the following:

    1. btrfs_direct_IO sets those two to the same value.

    2. When we call __blockdev_direct_IO unless
    btrfs_get_blocks_direct->btrfs_get_blocks_direct_write is called to
    modify unsubmitted_oe_range_start so that start < end. Cleanup
    won't happen.

    3. We come into btrfs_submit_direct - if it dip allocation fails we'd
    return with oe_range_end now modified so cleanup will happen.

    4. If we manage to allocate the dip we reset the unsubmitted range
    members to be equal so that cleanup happens from
    btrfs_endio_direct_write.

    This 4-step logic is not really obvious, especially given it's scattered
    across 3 functions.

    Fixes: f28a49287817 ("Btrfs: fix leaking of ordered extents after direct IO write error")
    Reviewed-by: Johannes Thumshirn
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Omar Sandoval
    [ add range start/end logic explanation from Nikolay ]
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Omar Sandoval
     
  • [ Upstream commit 7c09c03091ac562ddca2b393e5d65c1d37da79f1 ]

    Deleting a subvolume on a full filesystem leads to ENOSPC followed by a
    forced read-only. This is not a transaction abort and the filesystem is
    otherwise ok, so the error should be just propagated to the callers.

    This is caused by unnecessary call to btrfs_handle_fs_error for all
    errors, except EAGAIN. This does not make sense as the standard
    transaction abort mechanism is in btrfs_drop_snapshot so all relevant
    failures are handled.

    Originally in commit cb1b69f4508a ("Btrfs: forced readonly when
    btrfs_drop_snapshot() fails") there was no return value at all, so the
    btrfs_std_error made some sense but once the error handling and
    propagation has been implemented we don't need it anymore.

    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    David Sterba
     
  • [ Upstream commit 5ddd9ced9aef6cfa76af27d384c17c9e2d610ce8 ]

    A GETATTR request can race with FUSE_NOTIFY_INVAL_INODE, resulting in the
    attribute cache being updated with stale information after the
    invalidation.

    Fix this by bumping the attribute version in fuse_reverse_inval_inode().

    Reported-by: Krzysztof Rusek
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Sasha Levin

    Miklos Szeredi
     
  • [ Upstream commit 32f98877c57bee6bc27f443a96f49678a2cd6a50 ]

    page_count() is unstable. Unless there has been an RCU grace period
    between when the page was removed from the page cache and now, a
    speculative reference may exist from the page cache.

    Reported-by: Matthew Wilcox
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Sasha Levin

    Miklos Szeredi
     
  • [ Upstream commit b849dd84b6ccfe32622988b79b7b073861fcf9f7 ]

    While trying to "dd" to the block device for a USB stick, I
    encountered a hung task warning (blocked for > 120 seconds). I
    managed to come up with an easy way to reproduce this on my system
    (where /dev/sdb is the block device for my USB stick) with:

    while true; do dd if=/dev/zero of=/dev/sdb bs=4M; done

    With my reproduction here are the relevant bits from the hung task
    detector:

    INFO: task udevd:294 blocked for more than 122 seconds.
    ...
    udevd D 0 294 1 0x00400008
    Call trace:
    ...
    mutex_lock_nested+0x40/0x50
    __blkdev_get+0x7c/0x3d4
    blkdev_get+0x118/0x138
    blkdev_open+0x94/0xa8
    do_dentry_open+0x268/0x3a0
    vfs_open+0x34/0x40
    path_openat+0x39c/0xdf4
    do_filp_open+0x90/0x10c
    do_sys_open+0x150/0x3c8
    ...

    ...
    Showing all locks held in the system:
    ...
    1 lock held by dd/2798:
    #0: ffffff814ac1a3b8 (&bdev->bd_mutex){+.+.}, at: __blkdev_put+0x50/0x204
    ...
    dd D 0 2798 2764 0x00400208
    Call trace:
    ...
    schedule+0x8c/0xbc
    io_schedule+0x1c/0x40
    wait_on_page_bit_common+0x238/0x338
    __lock_page+0x5c/0x68
    write_cache_pages+0x194/0x500
    generic_writepages+0x64/0xa4
    blkdev_writepages+0x24/0x30
    do_writepages+0x48/0xa8
    __filemap_fdatawrite_range+0xac/0xd8
    filemap_write_and_wait+0x30/0x84
    __blkdev_put+0x88/0x204
    blkdev_put+0xc4/0xe4
    blkdev_close+0x28/0x38
    __fput+0xe0/0x238
    ____fput+0x1c/0x28
    task_work_run+0xb0/0xe4
    do_notify_resume+0xfc0/0x14bc
    work_pending+0x8/0x14

    The problem appears related to the fact that my USB disk is terribly
    slow and that I have a lot of RAM in my system to cache things.
    Specifically my writes seem to be happening at ~15 MB/s and I've got
    ~4 GB of RAM in my system that can be used for buffering. To write 4
    GB of buffer to disk thus takes ~4000 MB / ~15 MB/s = ~267 seconds.

    The 267 second number is a problem because in __blkdev_put() we call
    sync_blockdev() while holding the bd_mutex. Any other callers who
    want the bd_mutex will be blocked for the whole time.

    The problem is made worse because I believe blkdev_put() specifically
    tells other tasks (namely udev) to go try to access the device at right
    around the same time we're going to hold the mutex for a long time.

    Putting some traces around this (after disabling the hung task detector),
    I could confirm:
    dd: 437.608600: __blkdev_put() right before sync_blockdev() for sdb
    udevd: 437.623901: blkdev_open() right before blkdev_get() for sdb
    dd: 661.468451: __blkdev_put() right after sync_blockdev() for sdb
    udevd: 663.820426: blkdev_open() right after blkdev_get() for sdb

    A simple fix for this is to realize that sync_blockdev() works fine if
    you're not holding the mutex. Also, it's not the end of the world if
    you sync a little early (though it can have performance impacts).
    Thus we can make a guess that we're going to need to do the sync and
    then do it without holding the mutex. We still do one last sync with
    the mutex but it should be much, much faster.

    With this, my hung task warnings for my test case are gone.

    Signed-off-by: Douglas Anderson
    Reviewed-by: Guenter Roeck
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Douglas Anderson
     
  • [ Upstream commit aec7db3b13a07d515c15ada752a7287a44a79ea0 ]

    I made a mistake with my previous fix, I assumed that we didn't need to
    mess with the reloc roots once we were out of the part of relocation where
    we are actually moving the extents.

    The subtle thing that I missed is that btrfs_init_reloc_root() also
    updates the last_trans for the reloc root when we do
    btrfs_record_root_in_trans() for the corresponding fs_root. I've added a
    comment to make sure future me doesn't make this mistake again.

    This showed up as a WARN_ON() in btrfs_copy_root() because our
    last_trans didn't == the current transid. This could happen if we
    snapshotted a fs root with a reloc root after we set
    rc->create_reloc_tree = 0, but before we actually merge the reloc root.

    Worth mentioning that the regression produced the following warning
    when running snapshot creation and balance in parallel:

    BTRFS info (device sdc): relocating block group 30408704 flags metadata|dup
    ------------[ cut here ]------------
    WARNING: CPU: 0 PID: 12823 at fs/btrfs/ctree.c:191 btrfs_copy_root+0x26f/0x430 [btrfs]
    CPU: 0 PID: 12823 Comm: btrfs Tainted: G W 5.6.0-rc7-btrfs-next-58 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_copy_root+0x26f/0x430 [btrfs]
    RSP: 0018:ffffb96e044279b8 EFLAGS: 00010202
    RAX: 0000000000000009 RBX: ffff9da70bf61000 RCX: ffffb96e04427a48
    RDX: ffff9da733a770c8 RSI: ffff9da70bf61000 RDI: ffff9da694163818
    RBP: ffff9da733a770c8 R08: fffffffffffffff8 R09: 0000000000000002
    R10: ffffb96e044279a0 R11: 0000000000000000 R12: ffff9da694163818
    R13: fffffffffffffff8 R14: ffff9da6d2512000 R15: ffff9da714cdac00
    FS: 00007fdeacf328c0(0000) GS:ffff9da735e00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 000055a2a5b8a118 CR3: 00000001eed78002 CR4: 00000000003606f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    ? create_reloc_root+0x49/0x2b0 [btrfs]
    ? kmem_cache_alloc_trace+0xe5/0x200
    create_reloc_root+0x8b/0x2b0 [btrfs]
    btrfs_reloc_post_snapshot+0x96/0x5b0 [btrfs]
    create_pending_snapshot+0x610/0x1010 [btrfs]
    create_pending_snapshots+0xa8/0xd0 [btrfs]
    btrfs_commit_transaction+0x4c7/0xc50 [btrfs]
    ? btrfs_mksubvol+0x3cd/0x560 [btrfs]
    btrfs_mksubvol+0x455/0x560 [btrfs]
    __btrfs_ioctl_snap_create+0x15f/0x190 [btrfs]
    btrfs_ioctl_snap_create_v2+0xa4/0xf0 [btrfs]
    ? mem_cgroup_commit_charge+0x6e/0x540
    btrfs_ioctl+0x12d8/0x3760 [btrfs]
    ? do_raw_spin_unlock+0x49/0xc0
    ? _raw_spin_unlock+0x29/0x40
    ? __handle_mm_fault+0x11b3/0x14b0
    ? ksys_ioctl+0x92/0xb0
    ksys_ioctl+0x92/0xb0
    ? trace_hardirqs_off_thunk+0x1a/0x1c
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x5c/0x280
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7fdeabd3bdd7

    Fixes: 2abc726ab4b8 ("btrfs: do not init a reloc root if we aren't relocating")
    Reviewed-by: Filipe Manana
    Signed-off-by: Josef Bacik
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik
     
  • [ Upstream commit 08ca8b21f760c0ed5034a5c122092eec22ccf8f4 ]

    When a subrequest is being detached from the subgroup, we want to
    ensure that it is not holding the group lock, or in the process
    of waiting for the group lock.

    Fixes: 5b2b5187fa85 ("NFS: Fix nfs_page_group_destroy() and nfs_lock_and_join_requests() race cases")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Sasha Levin

    Trond Myklebust
     
  • [ Upstream commit acc5af3efa303d5f36cc8c0f61716161f6ca1384 ]

    In “ubifs_check_node”, when the value of "node_len" is abnormal,
    the code will goto label of "out_len" for execution. Then, in the
    following "ubifs_dump_node", if inode type is "UBIFS_DATA_NODE",
    in "print_hex_dump", an out-of-bounds access may occur due to the
    wrong "ch->len".

    Therefore, when the value of "node_len" is abnormal, data length
    should to be adjusted to a reasonable safe range. At this time,
    structured data is not credible, so dump the corrupted data directly
    for analysis.

    Signed-off-by: Liu Song
    Signed-off-by: Richard Weinberger
    Signed-off-by: Sasha Levin

    Liu Song
     
  • [ Upstream commit 927cc5cec35f01fe4f8af0ba80830a90b0533983 ]

    Memory leak occurs when files with extended attributes are added to
    orphan list.

    Signed-off-by: Zhihao Cheng
    Fixes: 988bec41318f3fa897e2f8 ("ubifs: orphan: Handle xattrs like files")
    Signed-off-by: Richard Weinberger
    Signed-off-by: Sasha Levin

    Zhihao Cheng
     
  • [ Upstream commit 81423c78551654953d746250f1721300b470be0e ]

    When inodes with extended attributes are evicted, xent is not freed in one
    exit branch.

    Signed-off-by: Zhihao Cheng
    Fixes: 9ca2d732644484488db3112 ("ubifs: Limit number of xattrs per inode")
    Signed-off-by: Richard Weinberger
    Signed-off-by: Sasha Levin

    Zhihao Cheng
     
  • [ Upstream commit 27fb5a72f50aa770dd38b0478c07acacef97e3e7 ]

    I noticed that fsfreeze can take a very long time to freeze an XFS if
    there happens to be a GETFSMAP caller running in the background. I also
    happened to notice the following in dmesg:

    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 43492 at fs/xfs/xfs_super.c:853 xfs_quiesce_attr+0x83/0x90 [xfs]
    Modules linked in: xfs libcrc32c ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 ip_set_hash_ip ip_set_hash_net xt_tcpudp xt_set ip_set_hash_mac ip_set nfnetlink ip6table_filter ip6_tables bfq iptable_filter sch_fq_codel ip_tables x_tables nfsv4 af_packet [last unloaded: xfs]
    CPU: 2 PID: 43492 Comm: xfs_io Not tainted 5.6.0-rc4-djw #rc4
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.10.2-1ubuntu1 04/01/2014
    RIP: 0010:xfs_quiesce_attr+0x83/0x90 [xfs]
    Code: 7c 07 00 00 85 c0 75 22 48 89 df 5b e9 96 c1 00 00 48 c7 c6 b0 2d 38 a0 48 89 df e8 57 64 ff ff 8b 83 7c 07 00 00 85 c0 74 de 0b 48 89 df 5b e9 72 c1 00 00 66 90 0f 1f 44 00 00 41 55 41 54
    RSP: 0018:ffffc900030f3e28 EFLAGS: 00010202
    RAX: 0000000000000001 RBX: ffff88802ac54000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffffffff81e4a6f0 RDI: 00000000ffffffff
    RBP: ffff88807859f070 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000010 R12: 0000000000000000
    R13: ffff88807859f388 R14: ffff88807859f4b8 R15: ffff88807859f5e8
    FS: 00007fad1c6c0fc0(0000) GS:ffff88807e000000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f0c7d237000 CR3: 0000000077f01003 CR4: 00000000001606a0
    Call Trace:
    xfs_fs_freeze+0x25/0x40 [xfs]
    freeze_super+0xc8/0x180
    do_vfs_ioctl+0x70b/0x750
    ? __fget_files+0x135/0x210
    ksys_ioctl+0x3a/0xb0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x50/0x1a0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    These two things appear to be related. The assertion trips when another
    thread initiates a fsmap request (which uses an empty transaction) after
    the freezer waited for m_active_trans to hit zero but before the the
    freezer executes the WARN_ON just prior to calling xfs_log_quiesce.

    The lengthy delays in freezing happen because the freezer calls
    xfs_wait_buftarg to clean out the buffer lru list. Meanwhile, the
    GETFSMAP caller is continuing to grab and release buffers, which means
    that it can take a very long time for the buffer lru list to empty out.

    We fix both of these races by calling sb_start_write to obtain freeze
    protection while using empty transactions for GETFSMAP and for metadata
    scrubbing. The other two users occur during mount, during which time we
    cannot fs freeze.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • [ Upstream commit 76518d3798855242817e8a8ed76b2d72f4415624 ]

    This changes do_io_accounting to use the new exec_update_mutex
    instead of cred_guard_mutex.

    This fixes possible deadlocks when the trace is accessing
    /proc/$pid/io for instance.

    This should be safe, as the credentials are only used for reading.

    Signed-off-by: Bernd Edlinger
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sasha Levin

    Bernd Edlinger
     
  • [ Upstream commit 2db9dbf71bf98d02a0bf33e798e5bfd2a9944696 ]

    This changes lock_trace to use the new exec_update_mutex
    instead of cred_guard_mutex.

    This fixes possible deadlocks when the trace is accessing
    /proc/$pid/stack for instance.

    This should be safe, as the credentials are only used for reading,
    and task->mm is updated on execve under the new exec_update_mutex.

    Signed-off-by: Bernd Edlinger
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sasha Levin

    Bernd Edlinger
     
  • [ Upstream commit eea9673250db4e854e9998ef9da6d4584857f0ea ]

    The cred_guard_mutex is problematic as it is held over possibly
    indefinite waits for userspace. The possible indefinite waits for
    userspace that I have identified are: The cred_guard_mutex is held in
    PTRACE_EVENT_EXIT waiting for the tracer. The cred_guard_mutex is
    held over "put_user(0, tsk->clear_child_tid)" in exit_mm(). The
    cred_guard_mutex is held over "get_user(futex_offset, ...") in
    exit_robust_list. The cred_guard_mutex held over copy_strings.

    The functions get_user and put_user can trigger a page fault which can
    potentially wait indefinitely in the case of userfaultfd or if
    userspace implements part of the page fault path.

    In any of those cases the userspace process that the kernel is waiting
    for might make a different system call that winds up taking the
    cred_guard_mutex and result in deadlock.

    Holding a mutex over any of those possibly indefinite waits for
    userspace does not appear necessary. Add exec_update_mutex that will
    just cover updating the process during exec where the permissions and
    the objects pointed to by the task struct may be out of sync.

    The plan is to switch the users of cred_guard_mutex to
    exec_update_mutex one by one. This lets us move forward while still
    being careful and not introducing any regressions.

    Link: https://lore.kernel.org/lkml/20160921152946.GA24210@dhcp22.suse.cz/
    Link: https://lore.kernel.org/lkml/AM6PR03MB5170B06F3A2B75EFB98D071AE4E60@AM6PR03MB5170.eurprd03.prod.outlook.com/
    Link: https://lore.kernel.org/linux-fsdevel/20161102181806.GB1112@redhat.com/
    Link: https://lore.kernel.org/lkml/20160923095031.GA14923@redhat.com/
    Link: https://lore.kernel.org/lkml/20170213141452.GA30203@redhat.com/
    Ref: 45c1a159b85b ("Add PTRACE_O_TRACEVFORKDONE and PTRACE_O_TRACEEXIT facilities.")
    Ref: 456f17cd1a28 ("[PATCH] user-vm-unlock-2.5.31-A2")
    Reviewed-by: Kirill Tkhai
    Signed-off-by: "Eric W. Biederman"
    Signed-off-by: Bernd Edlinger
    Signed-off-by: Eric W. Biederman
    Signed-off-by: Sasha Levin

    Eric W. Biederman
     
  • [ Upstream commit 1a0afa0ecfc4dbc8d7583d03cafd3f68f781df0c ]

    If we have an error while processing the reloc roots we could leak roots
    that were added to rc->reloc_roots before we hit the error. We could
    have also not removed the reloc tree mapping from our rb_tree, so clean
    up any remaining nodes in the reloc root rb_tree.

    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    [ use rbtree_postorder_for_each_entry_safe ]
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik
     
  • [ Upstream commit 2abc726ab4b83db774e315c660ab8da21477092f ]

    We previously were checking if the root had a dead root before accessing
    root->reloc_root in order to avoid a use-after-free type bug. However
    this scenario happens after we've unset the reloc control, so we would
    have been saved if we'd simply checked for fs_info->reloc_control. At
    this point during relocation we no longer need to be creating new reloc
    roots, so simply move this check above the reloc_root checks to avoid
    any future races and confusion.

    Reviewed-by: Qu Wenruo
    Signed-off-by: Josef Bacik
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik
     
  • [ Upstream commit a451b12311aa8c96c6f6e01c783a86995dc3ec6b ]

    In NFSv4, the lock stateids are tied to the lockowner, and the open stateid,
    so that the action of closing the file also results in either an automatic
    loss of the locks, or an error of the form NFS4ERR_LOCKS_HELD.

    In practice this means we must not add new locks to the open stateid
    after the close process has been invoked. In fact doing so, can result
    in the following panic:

    kernel BUG at lib/list_debug.c:51!
    invalid opcode: 0000 [#1] SMP NOPTI
    CPU: 2 PID: 1085 Comm: nfsd Not tainted 5.6.0-rc3+ #2
    Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.14410784.B64.1908150010 08/15/2019
    RIP: 0010:__list_del_entry_valid.cold+0x31/0x55
    Code: 1a 3d 9b e8 74 10 c2 ff 0f 0b 48 c7 c7 f0 1a 3d 9b e8 66 10 c2 ff 0f 0b 48 89 f2 48 89 fe 48 c7 c7 b0 1a 3d 9b e8 52 10 c2 ff 0b 48 89 fe 4c 89 c2 48 c7 c7 78 1a 3d 9b e8 3e 10 c2 ff 0f 0b
    RSP: 0018:ffffb296c1d47d90 EFLAGS: 00010246
    RAX: 0000000000000054 RBX: ffff8ba032456ec8 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffff8ba039e99cc8 RDI: ffff8ba039e99cc8
    RBP: ffff8ba032456e60 R08: 0000000000000781 R09: 0000000000000003
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff8ba009a4abe0
    R13: ffff8ba032456e8c R14: 0000000000000000 R15: ffff8ba00adb01d8
    FS: 0000000000000000(0000) GS:ffff8ba039e80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fb213f0b008 CR3: 00000001347de006 CR4: 00000000003606e0
    Call Trace:
    release_lock_stateid+0x2b/0x80 [nfsd]
    nfsd4_free_stateid+0x1e9/0x210 [nfsd]
    nfsd4_proc_compound+0x414/0x700 [nfsd]
    ? nfs4svc_decode_compoundargs+0x407/0x4c0 [nfsd]
    nfsd_dispatch+0xc1/0x200 [nfsd]
    svc_process_common+0x476/0x6f0 [sunrpc]
    ? svc_sock_secure_port+0x12/0x30 [sunrpc]
    ? svc_recv+0x313/0x9c0 [sunrpc]
    ? nfsd_svc+0x2d0/0x2d0 [nfsd]
    svc_process+0xd4/0x110 [sunrpc]
    nfsd+0xe3/0x140 [nfsd]
    kthread+0xf9/0x130
    ? nfsd_destroy+0x50/0x50 [nfsd]
    ? kthread_park+0x90/0x90
    ret_from_fork+0x1f/0x40

    The fix is to ensure that lock creation tests for whether or not the
    open stateid is unhashed, and to fail if that is the case.

    Fixes: 659aefb68eca ("nfsd: Ensure we don't recognise lock stateids after freeing them")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Chuck Lever
    Signed-off-by: Sasha Levin

    Trond Myklebust
     
  • [ Upstream commit eb5760863fc28feab28b567ddcda7e667e638da0 ]

    We already has similar code in ext4_mb_complex_scan_group(), but
    ext4_mb_simple_scan_group() still affected.

    Other reports: https://www.spinics.net/lists/linux-ext4/msg60231.html

    Reviewed-by: Andreas Dilger
    Signed-off-by: Dmitry Monakhov
    Link: https://lore.kernel.org/r/20200310150156.641-1-dmonakhov@gmail.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Dmitry Monakhov
     
  • [ Upstream commit 2e107cf869eecc770e3f630060bb4e5f547d0fd8 ]

    In xchk_dir_actor, we attempt to validate the directory hash structures
    by performing a directory entry lookup by (hashed) name. If the lookup
    returns ENOENT, that means that the hash information is corrupt. The
    _process_error functions don't catch this, so we have to add that
    explicitly.

    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • [ Upstream commit 1cb5deb5bc095c070c09a4540c45f9c9ba24be43 ]

    If we decide that a directory free block is corrupt, we must take care
    not to leak a buffer pointer to the caller. After xfs_trans_brelse
    returns, the buffer can be freed or reused, which means that we have to
    set *bpp back to NULL.

    Callers are supposed to notice the nonzero return value and not use the
    buffer pointer, but we should code more defensively, even if all current
    callers handle this situation correctly.

    Fixes: de14c5f541e7 ("xfs: verify free block header fields")
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Dave Chinner
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • [ Upstream commit dce8e237100f60c28cc66effb526ba65a01d8cb3 ]

    KCSAN find inode->i_disksize could be accessed concurrently.

    BUG: KCSAN: data-race in ext4_mark_iloc_dirty / ext4_write_end

    write (marked) to 0xffff8b8932f40090 of 8 bytes by task 66792 on cpu 0:
    ext4_write_end+0x53f/0x5b0
    ext4_da_write_end+0x237/0x510
    generic_perform_write+0x1c4/0x2a0
    ext4_buffered_write_iter+0x13a/0x210
    ext4_file_write_iter+0xe2/0x9b0
    new_sync_write+0x29c/0x3a0
    __vfs_write+0x92/0xa0
    vfs_write+0xfc/0x2a0
    ksys_write+0xe8/0x140
    __x64_sys_write+0x4c/0x60
    do_syscall_64+0x8a/0x2a0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    read to 0xffff8b8932f40090 of 8 bytes by task 14414 on cpu 1:
    ext4_mark_iloc_dirty+0x716/0x1190
    ext4_mark_inode_dirty+0xc9/0x360
    ext4_convert_unwritten_extents+0x1bc/0x2a0
    ext4_convert_unwritten_io_end_vec+0xc5/0x150
    ext4_put_io_end+0x82/0x130
    ext4_writepages+0xae7/0x16f0
    do_writepages+0x64/0x120
    __writeback_single_inode+0x7d/0x650
    writeback_sb_inodes+0x3a4/0x860
    __writeback_inodes_wb+0xc4/0x150
    wb_writeback+0x43f/0x510
    wb_workfn+0x3b2/0x8a0
    process_one_work+0x39b/0x7e0
    worker_thread+0x88/0x650
    kthread+0x1d4/0x1f0
    ret_from_fork+0x35/0x40

    The plain read is outside of inode->i_data_sem critical section
    which results in a data race. Fix it by adding READ_ONCE().

    Signed-off-by: Qiujun Huang
    Link: https://lore.kernel.org/r/1582556566-3909-1-git-send-email-hqjagain@gmail.com
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Sasha Levin

    Qiujun Huang
     
  • [ Upstream commit a9ceb060b3cf37987b6162223575eaf4f4e0fc36 ]

    perf does not know how to deal with a __builtin_bswap32() call, and
    complains. All other functions just store the xid etc in host endian
    form, so let's do that in the tracepoint for nfsd_file_acquire too.

    Signed-off-by: Trond Myklebust
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Sasha Levin

    Trond Myklebust
     
  • [ Upstream commit 9a6bed4fe0c8bf57785cbc4db9f86086cb9b193d ]

    If the caller passes in a NULL cap_reservation, and we can't allocate
    one then ensure that we fail gracefully.

    Signed-off-by: Jeff Layton
    Signed-off-by: Ilya Dryomov
    Signed-off-by: Sasha Levin

    Jeff Layton
     
  • [ Upstream commit 90d2f1da832fd23290ef0c0d964d97501e5e8553 ]

    If nfsd_file_mark_find_or_create() keeps winning the race for the
    nfsd_file_fsnotify_group->mark_mutex against nfsd_file_mark_put()
    then it can soft lock up, since fsnotify_add_inode_mark() ends
    up always finding an existing entry.

    Signed-off-by: Trond Myklebust
    Signed-off-by: J. Bruce Fields
    Signed-off-by: Sasha Levin

    Trond Myklebust
     
  • [ Upstream commit f6d2a5c263afca84646cf3300dc13061bedbd99e ]

    Inspired by btrfs-progs github issue #208, where chunk item in chunk
    tree has invalid num_stripes (0).

    Although that can already be caught by current btrfs_check_chunk_valid(),
    that function doesn't really check item size as it needs to handle chunk
    item in super block sys_chunk_array().

    This patch will add two extra checks for chunk items in chunk tree:

    - Basic chunk item size
    If the item is smaller than btrfs_chunk (which already contains one
    stripe), exit right now as reading num_stripes may even go beyond
    eb boundary.

    - Item size check against num_stripes
    If item size doesn't match with calculated chunk size, then either the
    item size or the num_stripes is corrupted. Error out anyway.

    Reviewed-by: Josef Bacik
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Qu Wenruo
     
  • [ Upstream commit b1de6fc7520fe12949c070af0e8c0e4044cd3420 ]

    Omar Sandoval reported that a 4G fallocate on the realtime device causes
    filesystem shutdowns due to a log reservation overflow that happens when
    we log the rtbitmap updates. Factor rtbitmap/rtsummary updates into the
    the tr_write and tr_itruncate log reservation calculation.

    "The following reproducer results in a transaction log overrun warning
    for me:

    mkfs.xfs -f -r rtdev=/dev/vdc -d rtinherit=1 -m reflink=0 /dev/vdb
    mount -o rtdev=/dev/vdc /dev/vdb /mnt
    fallocate -l 4G /mnt/foo

    Reported-by: Omar Sandoval
    Tested-by: Omar Sandoval
    Signed-off-by: Darrick J. Wong
    Reviewed-by: Brian Foster
    Signed-off-by: Sasha Levin

    Darrick J. Wong
     
  • [ Upstream commit 0c4da70c83d41a8461fdf50a3f7b292ecb04e378 ]

    Realtime files in XFS allocate extents in rextsize units. However, the
    written/unwritten state of those extents is still tracked in blocksize
    units. Therefore, a realtime file can be split up into written and
    unwritten extents that are not necessarily aligned to the realtime
    extent size. __xfs_bunmapi() has some logic to handle these various
    corner cases. Consider how it handles the following case:

    1. The last extent is unwritten.
    2. The last extent is smaller than the realtime extent size.
    3. startblock of the last extent is not aligned to the realtime extent
    size, but startblock + blockcount is.

    In this case, __xfs_bunmapi() calls xfs_bmap_add_extent_unwritten_real()
    to set the second-to-last extent to unwritten. This should merge the
    last and second-to-last extents, so __xfs_bunmapi() moves on to the
    second-to-last extent.

    However, if the size of the last and second-to-last extents combined is
    greater than MAXEXTLEN, xfs_bmap_add_extent_unwritten_real() does not
    merge the two extents. When that happens, __xfs_bunmapi() skips past the
    last extent without unmapping it, thus leaking the space.

    Fix it by only unwriting the minimum amount needed to align the last
    extent to the realtime extent size, which is guaranteed to merge with
    the last extent.

    Signed-off-by: Omar Sandoval
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong
    Signed-off-by: Sasha Levin

    Omar Sandoval