27 Jan, 2021

3 commits

  • Wire up the splice_read and splice_write methods to the default
    helpers using ->read_iter and ->write_iter now that those are
    implemented for kernfs. This restores support to use splice and
    sendfile on kernfs files.

    Fixes: 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops")
    Reported-by: Siddharth Gupta
    Tested-by: Siddharth Gupta
    Signed-off-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20210120204631.274206-4-hch@lst.de
    Signed-off-by: Greg Kroah-Hartman
    (cherry picked from commit f2d6c2708bd84ca953fa6b6ca5717e79eb0140c7)
    Bug: 176079972
    Signed-off-by: Greg Kroah-Hartman
    Change-Id: I80f09db0b8569fa63db59ada9c3d45d6600b2d70

    Christoph Hellwig
     
  • Switch kernfs to implement the write_iter method instead of plain old
    write to prepare to supporting splice and sendfile again.

    Signed-off-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20210120204631.274206-3-hch@lst.de
    Signed-off-by: Greg Kroah-Hartman
    (cherry picked from commit cc099e0b399889c6485c88368b19824b087c9f8c)
    Signed-off-by: Greg Kroah-Hartman
    Change-Id: I660c93c520169534c874cef2d246fef8c8fe2bbc

    Christoph Hellwig
     
  • Switch kernfs to implement the read_iter method instead of plain old
    read to prepare to supporting splice and sendfile again.

    Signed-off-by: Christoph Hellwig
    Link: https://lore.kernel.org/r/20210120204631.274206-2-hch@lst.de
    Signed-off-by: Greg Kroah-Hartman
    (cherry picked from commit 4eaad21a6ac9865df7f31983232ed5928450458d)
    Signed-off-by: Greg Kroah-Hartman
    Change-Id: I75af3b3bae22de7ba46ec38b8182be335d6760e8

    Christoph Hellwig
     

20 Jan, 2021

31 commits

  • Changes in 5.10.9
    btrfs: reloc: fix wrong file extent type check to avoid false ENOENT
    btrfs: prevent NULL pointer dereference in extent_io_tree_panic
    ALSA: hda/realtek: fix right sounds and mute/micmute LEDs for HP machines
    ALSA: doc: Fix reference to mixart.rst
    ASoC: AMD Renoir - add DMI entry for Lenovo ThinkPad X395
    ASoC: dapm: remove widget from dirty list on free
    x86/hyperv: check cpu mask after interrupt has been disabled
    drm/amdgpu: add green_sardine device id (v2)
    drm/amdgpu: fix DRM_INFO flood if display core is not supported (bug 210921)
    Revert "drm/amd/display: Fixed Intermittent blue screen on OLED panel"
    drm/amdgpu: add new device id for Renior
    drm/i915: Allow the sysadmin to override security mitigations
    drm/i915/gt: Limit VFE threads based on GT
    drm/i915/backlight: fix CPU mode backlight takeover on LPT
    drm/bridge: sii902x: Refactor init code into separate function
    dt-bindings: display: sii902x: Add supply bindings
    drm/bridge: sii902x: Enable I/O and core VCC supplies if present
    tracing/kprobes: Do the notrace functions check without kprobes on ftrace
    tools/bootconfig: Add tracing_on support to helper scripts
    ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR
    ext4: fix wrong list_splice in ext4_fc_cleanup
    ext4: fix bug for rename with RENAME_WHITEOUT
    cifs: check pointer before freeing
    cifs: fix interrupted close commands
    riscv: Drop a duplicated PAGE_KERNEL_EXEC
    riscv: return -ENOSYS for syscall -1
    riscv: Fixup CONFIG_GENERIC_TIME_VSYSCALL
    riscv: Fix KASAN memory mapping.
    mips: fix Section mismatch in reference
    mips: lib: uncached: fix non-standard usage of variable 'sp'
    MIPS: boot: Fix unaligned access with CONFIG_MIPS_RAW_APPENDED_DTB
    MIPS: Fix malformed NT_FILE and NT_SIGINFO in 32bit coredumps
    MIPS: relocatable: fix possible boot hangup with KASLR enabled
    RDMA/ocrdma: Fix use after free in ocrdma_dealloc_ucontext_pd()
    ACPI: scan: Harden acpi_device_add() against device ID overflows
    xen/privcmd: allow fetching resource sizes
    compiler.h: Raise minimum version of GCC to 5.1 for arm64
    mm/vmalloc.c: fix potential memory leak
    mm/hugetlb: fix potential missing huge page size info
    mm/process_vm_access.c: include compat.h
    dm raid: fix discard limits for raid1
    dm snapshot: flush merged data before committing metadata
    dm integrity: fix flush with external metadata device
    dm integrity: fix the maximum number of arguments
    dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
    dm crypt: do not wait for backlogged crypto request completion in softirq
    dm crypt: do not call bio_endio() from the dm-crypt tasklet
    dm crypt: defer decryption to a tasklet if interrupts disabled
    stmmac: intel: change all EHL/TGL to auto detect phy addr
    r8152: Add Lenovo Powered USB-C Travel Hub
    btrfs: tree-checker: check if chunk item end overflows
    ext4: don't leak old mountpoint samples
    io_uring: don't take files/mm for a dead task
    io_uring: drop mm and files after task_work_run
    ARC: build: remove non-existing bootpImage from KBUILD_IMAGE
    ARC: build: add uImage.lzma to the top-level target
    ARC: build: add boot_targets to PHONY
    ARC: build: move symlink creation to arch/arc/Makefile to avoid race
    ARM: omap2: pmic-cpcap: fix maximum voltage to be consistent with defaults on xt875
    ath11k: fix crash caused by NULL rx_channel
    netfilter: ipset: fixes possible oops in mtype_resize
    ath11k: qmi: try to allocate a big block of DMA memory first
    btrfs: fix async discard stall
    btrfs: merge critical sections of discard lock in workfn
    btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan
    regulator: bd718x7: Add enable times
    ethernet: ucc_geth: fix definition and size of ucc_geth_tx_global_pram
    ARM: dts: ux500/golden: Set display max brightness
    habanalabs: adjust pci controller init to new firmware
    habanalabs/gaudi: retry loading TPC f/w on -EINTR
    habanalabs: register to pci shutdown callback
    staging: spmi: hisi-spmi-controller: Fix some error handling paths
    spi: altera: fix return value for altera_spi_txrx()
    habanalabs: Fix memleak in hl_device_reset
    hwmon: (pwm-fan) Ensure that calculation doesn't discard big period values
    lib/raid6: Let $(UNROLL) rules work with macOS userland
    kconfig: remove 'kvmconfig' and 'xenconfig' shorthands
    spi: fix the divide by 0 error when calculating xfer waiting time
    io_uring: drop file refs after task cancel
    bfq: Fix computation of shallow depth
    arch/arc: add copy_user_page() to to fix build error on ARC
    misdn: dsp: select CONFIG_BITREVERSE
    net: ethernet: fs_enet: Add missing MODULE_LICENSE
    selftests: fix the return value for UDP GRO test
    nvme-pci: mark Samsung PM1725a as IGNORE_DEV_SUBNQN
    nvme: avoid possible double fetch in handling CQE
    nvmet-rdma: Fix list_del corruption on queue establishment failure
    drm/amd/display: fix sysfs amdgpu_current_backlight_pwm NULL pointer issue
    drm/amdgpu: fix a GPU hang issue when remove device
    drm/amd/pm: fix the failure when change power profile for renoir
    drm/amdgpu: fix potential memory leak during navi12 deinitialization
    usb: typec: Fix copy paste error for NVIDIA alt-mode description
    iommu/vt-d: Fix lockdep splat in sva bind()/unbind()
    ACPI: scan: add stub acpi_create_platform_device() for !CONFIG_ACPI
    drm/msm: Call msm_init_vram before binding the gpu
    ARM: picoxcell: fix missing interrupt-parent properties
    poll: fix performance regression due to out-of-line __put_user()
    rcu-tasks: Move RCU-tasks initialization to before early_initcall()
    bpf: Simplify task_file_seq_get_next()
    bpf: Save correct stopping point in file seq iteration
    x86/sev-es: Fix SEV-ES OUT/IN immediate opcode vc handling
    cfg80211: select CONFIG_CRC32
    nvme-fc: avoid calling _nvme_fc_abort_outstanding_ios from interrupt context
    iommu/vt-d: Update domain geometry in iommu_ops.at(de)tach_dev
    net/mlx5e: CT: Use per flow counter when CT flow accounting is enabled
    net/mlx5: Fix passing zero to 'PTR_ERR'
    net/mlx5: E-Switch, fix changing vf VLANID
    blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED
    mm: fix clear_refs_write locking
    mm: don't play games with pinned pages in clear_page_refs
    mm: don't put pinned pages into the swap cache
    perf intel-pt: Fix 'CPU too large' error
    dump_common_audit_data(): fix racy accesses to ->d_name
    ASoC: meson: axg-tdm-interface: fix loopback
    ASoC: meson: axg-tdmin: fix axg skew offset
    ASoC: Intel: fix error code cnl_set_dsp_D0()
    nvmet-rdma: Fix NULL deref when setting pi_enable and traddr INADDR_ANY
    nvme: don't intialize hwmon for discovery controllers
    nvme-tcp: fix possible data corruption with bio merges
    nvme-tcp: Fix warning with CONFIG_DEBUG_PREEMPT
    NFS4: Fix use-after-free in trace_event_raw_event_nfs4_set_lock
    pNFS: We want return-on-close to complete when evicting the inode
    pNFS: Mark layout for return if return-on-close was not sent
    pNFS: Stricter ordering of layoutget and layoutreturn
    NFS: Adjust fs_context error logging
    NFS/pNFS: Don't call pnfs_free_bucket_lseg() before removing the request
    NFS/pNFS: Don't leak DS commits in pnfs_generic_retry_commit()
    NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter
    NFS: nfs_delegation_find_inode_server must first reference the superblock
    NFS: nfs_igrab_and_active must first reference the superblock
    scsi: ufs: Fix possible power drain during system suspend
    ext4: fix superblock checksum failure when setting password salt
    RDMA/restrack: Don't treat as an error allocation ID wrapping
    RDMA/usnic: Fix memleak in find_free_vf_and_create_qp_grp
    bnxt_en: Improve stats context resource accounting with RDMA driver loaded.
    RDMA/mlx5: Fix wrong free of blue flame register on error
    IB/mlx5: Fix error unwinding when set_has_smi_cap fails
    umount(2): move the flag validity checks first
    dm zoned: select CONFIG_CRC32
    drm/i915/dsi: Use unconditional msleep for the panel_on_delay when there is no reset-deassert MIPI-sequence
    drm/i915/icl: Fix initing the DSI DSC power refcount during HW readout
    drm/i915/gt: Restore clear-residual mitigations for Ivybridge, Baytrail
    mm, slub: consider rest of partial list if acquire_slab() fails
    riscv: Trace irq on only interrupt is enabled
    iommu/vt-d: Fix unaligned addresses for intel_flush_svm_range_dev()
    net: sunrpc: interpret the return value of kstrtou32 correctly
    selftests: netfilter: Pass family parameter "-f" to conntrack tool
    dm: eliminate potential source of excessive kernel log noise
    ALSA: fireface: Fix integer overflow in transmit_midi_msg()
    ALSA: firewire-tascam: Fix integer overflow in midi_port_work()
    netfilter: conntrack: fix reading nf_conntrack_buckets
    netfilter: nf_nat: Fix memleak in nf_nat_init
    Linux 5.10.9

    Signed-off-by: Greg Kroah-Hartman
    Change-Id: I609e501511889081e03d2d18ee7e1be95406f396

    Greg Kroah-Hartman
     
  • commit a0a6df9afcaf439a6b4c88a3b522e3d05fdef46f upstream.

    Unfortunately, there's userland code that used to rely upon these
    checks being done before anything else to check for UMOUNT_NOFOLLOW
    support. That broke in 41525f56e256 ("fs: refactor ksys_umount").
    Separate those from the rest of checks and move them to ksys_umount();
    unlike everything else in there, this can be sanely done there.

    Reported-by: Sargun Dhillon
    Fixes: 41525f56e256 ("fs: refactor ksys_umount")
    Signed-off-by: Al Viro
    Signed-off-by: Greg Kroah-Hartman

    Al Viro
     
  • commit dfd56c2c0c0dbb11be939b804ddc8d5395ab3432 upstream.

    When setting password salt in the superblock, we forget to recompute the
    superblock checksum so it will not match until the next superblock
    modification which recomputes the checksum. Fix it.

    CC: Michael Halcrow
    Reported-by: Andreas Dilger
    Fixes: 9bd8212f981e ("ext4 crypto: add encryption policy and password salt support")
    Signed-off-by: Jan Kara
    Link: https://lore.kernel.org/r/20201216101844.22917-8-jack@suse.cz
    Signed-off-by: Theodore Ts'o
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     
  • commit 896567ee7f17a8a736cda8a28cc987228410a2ac upstream.

    Before referencing the inode, we must ensure that the superblock can be
    referenced. Otherwise, we can end up with iput() calling superblock
    operations that are no longer valid or accessible.

    Fixes: ea7c38fef0b7 ("NFSv4: Ensure we reference the inode for return-on-close in delegreturn")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 113aac6d567bda783af36d08f73bfda47d8e9a40 upstream.

    Before referencing the inode, we must ensure that the superblock can be
    referenced. Otherwise, we can end up with iput() calling superblock
    operations that are no longer valid or accessible.

    Fixes: e39d8a186ed0 ("NFSv4: Fix an Oops during delegation callbacks")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit cb2856c5971723910a86b7d1d0cf623d6919cbc4 upstream.

    If we exit _lgopen_prepare_attached() without setting a layout, we will
    currently leak the plh_outstanding counter.

    Fixes: 411ae722d10a ("pNFS: Wait for stale layoutget calls to complete in pnfs_update_layout()")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 46c9ea1d4fee4cf1f8cc6001b9c14aae61b3d502 upstream.

    We must ensure that we pass a layout segment to nfs_retry_commit() when
    we're cleaning up after pnfs_bucket_alloc_ds_commits(). Otherwise,
    requests that should be committed to the DS will get committed to the
    MDS.
    Do so by ensuring that pnfs_bucket_get_committing() always tries to
    return a layout segment when it returns a non-empty page list.

    Fixes: c84bea59449a ("NFS/pNFS: Simplify bucket layout segment reference counting")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 1757655d780d9d29bc4b60e708342e94924f7ef3 upstream.

    In pnfs_generic_clear_request_commit(), we try calling
    pnfs_free_bucket_lseg() before we remove the request from the DS bucket.
    That will always fail, since the point is to test for whether or not
    that bucket is empty.

    Fixes: c84bea59449a ("NFS/pNFS: Simplify bucket layout segment reference counting")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit c98e9daa59a611ff4e163689815f40380c912415 upstream.

    Several existing dprink()/dfprintk() calls were converted to use the new
    mount API logging macros by commit ce8866f0913f ("NFS: Attach
    supplementary error information to fs_context"). If the fs_context was
    not created using fsopen() then it will not have had a log buffer
    allocated for it, and the new mount API logging macros will wind up
    calling printk().

    This can result in syslog messages being logged where previously there
    were none... most notably "NFS4: Couldn't follow remote path", which can
    happen if the client is auto-negotiating a protocol version with an NFS
    server that doesn't support the higher v4.x versions.

    Convert the nfs_errorf(), nfs_invalf(), and nfs_warnf() macros to check
    for the existence of the fs_context's log buffer and call dprintk() if
    it doesn't exist. Add nfs_ferrorf(), nfs_finvalf(), and nfs_warnf(),
    which do the same thing but take an NFS debug flag as an argument and
    call dfprintk(). Finally, modify the "NFS4: Couldn't follow remote
    path" message to use nfs_ferrorf().

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207385
    Signed-off-by: Scott Mayhew
    Reviewed-by: Benjamin Coddington
    Fixes: ce8866f0913f ("NFS: Attach supplementary error information to fs_context.")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Scott Mayhew
     
  • commit 2c8d5fc37fe2384a9bdb6965443ab9224d46f704 upstream.

    If a layout return is in progress, we should wait for it to complete,
    in case the layout segment we are picking up gets returned too.

    Fixes: 30cb3ee299cb ("pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the state seqid")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 67bbceedc9bb8ad48993a8bd6486054756d711f4 upstream.

    If the layout return-on-close failed because the layoutreturn was never
    sent, then we should mark the layout for return again.

    Fixes: 9c47b18cf722 ("pNFS: Ensure we do clear the return-on-close layout stateid on fatal errors")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 078000d02d57f02dde61de4901f289672e98c8bc upstream.

    If the inode is being evicted, it should be safe to run return-on-close,
    so we should do it to ensure we don't inadvertently leak layout segments.

    Fixes: 1c5bd76d17cc ("pNFS: Enable layoutreturn operation for return-on-close")
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Trond Myklebust
     
  • commit 3d1a90ab0ed93362ec8ac85cf291243c87260c21 upstream.

    It is only safe to call the tracepoint before rpc_put_task() because
    'data' is freed inside nfs4_lock_release (rpc_release).

    Fixes: 48c9579a1afe ("Adding stateid information to tracepoints")
    Signed-off-by: Dave Wysochanski
    Signed-off-by: Trond Myklebust
    Signed-off-by: Greg Kroah-Hartman

    Dave Wysochanski
     
  • [ Upstream commit 9348b73c2e1bfea74ccd4a44fb4ccc7276ab9623 ]

    Turning a pinned page read-only breaks the pinning after COW. Don't do it.

    The whole "track page soft dirty" state doesn't work with pinned pages
    anyway, since the page might be dirtied by the pinning entity without
    ever being noticed in the page tables.

    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Linus Torvalds
     
  • [ Upstream commit 29a951dfb3c3263c3a0f3bd9f7f2c2cfde4baedb ]

    Turning page table entries read-only requires the mmap_sem held for
    writing.

    So stop doing the odd games with turning things from read locks to write
    locks and back. Just get the write lock.

    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Linus Torvalds
     
  • [ Upstream commit ef0ba05538299f1391cbe097de36895bb36ecfe6 ]

    The kernel test robot reported a -5.8% performance regression on the
    "poll2" test of will-it-scale, and bisected it to commit d55564cfc222
    ("x86: Make __put_user() generate an out-of-line call").

    I didn't expect an out-of-line __put_user() to matter, because no normal
    core code should use that non-checking legacy version of user access any
    more. But I had overlooked the very odd poll() usage, which does a
    __put_user() to update the 'revents' values of the poll array.

    Now, Al Viro correctly points out that instead of updating just the
    'revents' field, it would be much simpler to just copy the _whole_
    pollfd entry, and then we could just use "copy_to_user()" on the whole
    array of entries, the same way we use "copy_from_user()" a few lines
    earlier to get the original values.

    But that is not what we've traditionally done, and I worry that threaded
    applications might be concurrently modifying the other fields of the
    pollfd array. So while Al's suggestion is simpler - and perhaps worth
    trying in the future - this instead keeps the "just update revents"
    model.

    To fix the performance regression, use the modern "unsafe_put_user()"
    instead of __put_user(), with the proper "user_write_access_begin()"
    guarding in place. This improves code generation enormously.

    Link: https://lore.kernel.org/lkml/20210107134723.GA28532@xsang-OptiPlex-9020/
    Reported-by: kernel test robot
    Tested-by: Oliver Sang
    Cc: Al Viro
    Cc: David Laight
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Linus Torvalds
     
  • [ Upstream commit de7f1d9e99d8b99e4e494ad8fcd91f0c4c5c9357 ]

    io_uring fds marked O_CLOEXEC and we explicitly cancel all requests
    before going through exec, so we don't want to leave task's file
    references to not our anymore io_uring instances.

    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     
  • [ Upstream commit cb13eea3b49055bd78e6ddf39defd6340f7379fc ]

    If we remount a filesystem in RO mode while the qgroup rescan worker is
    running, we can end up having it still running after the remount is done,
    and at unmount time we may end up with an open transaction that ends up
    never getting committed. If that happens we end up with several memory
    leaks and can crash when hardware acceleration is unavailable for crc32c.
    Possibly it can lead to other nasty surprises too, due to use-after-free
    issues.

    The following steps explain how the problem happens.

    1) We have a filesystem mounted in RW mode and the qgroup rescan worker is
    running;

    2) We remount the filesystem in RO mode, and never stop/pause the rescan
    worker, so after the remount the rescan worker is still running. The
    important detail here is that the rescan task is still running after
    the remount operation committed any ongoing transaction through its
    call to btrfs_commit_super();

    3) The rescan is still running, and after the remount completed, the
    rescan worker started a transaction, after it finished iterating all
    leaves of the extent tree, to update the qgroup status item in the
    quotas tree. It does not commit the transaction, it only releases its
    handle on the transaction;

    4) A filesystem unmount operation starts shortly after;

    5) The unmount task, at close_ctree(), stops the transaction kthread,
    which had not had a chance to commit the open transaction since it was
    sleeping and the commit interval (default of 30 seconds) has not yet
    elapsed since the last time it committed a transaction;

    6) So after stopping the transaction kthread we still have the transaction
    used to update the qgroup status item open. At close_ctree(), when the
    filesystem is in RO mode and no transaction abort happened (or the
    filesystem is in error mode), we do not expect to have any transaction
    open, so we do not call btrfs_commit_super();

    7) We then proceed to destroy the work queues, free the roots and block
    groups, etc. After that we drop the last reference on the btree inode
    by calling iput() on it. Since there are dirty pages for the btree
    inode, corresponding to the COWed extent buffer for the quotas btree,
    btree_write_cache_pages() is invoked to flush those dirty pages. This
    results in creating a bio and submitting it, which makes us end up at
    btrfs_submit_metadata_bio();

    8) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
    that calls btrfs_wq_submit_bio(), because check_async_write() returned
    a value of 1. This value of 1 is because we did not have hardware
    acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
    set in fs_info->flags;

    9) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
    workqueue at fs_info->workers, which was already freed before by the
    call to btrfs_stop_all_workers() at close_ctree(). This results in an
    invalid memory access due to a use-after-free, leading to a crash.

    When this happens, before the crash there are several warnings triggered,
    since we have reserved metadata space in a block group, the delayed refs
    reservation, etc:

    ------------[ cut here ]------------
    WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
    Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
    CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
    Code: f0 01 00 00 48 39 c2 75 (...)
    RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
    RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
    RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
    RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
    R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
    FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
    close_ctree+0x2ba/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f15ee221ee7
    Code: ff 0b 00 f7 d8 64 89 01 48 (...)
    RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
    RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
    RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
    RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
    R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
    R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last enabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last disabled at (0): [] 0x0
    ---[ end trace dd74718fef1ed5c6 ]---
    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
    Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
    CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
    Code: 48 83 bb b0 03 00 00 00 (...)
    RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
    RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
    RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
    R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
    FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
    close_ctree+0x2ba/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f15ee221ee7
    Code: ff 0b 00 f7 d8 64 89 01 (...)
    RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
    RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
    RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
    RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
    R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
    R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last enabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last disabled at (0): [] 0x0
    ---[ end trace dd74718fef1ed5c7 ]---
    ------------[ cut here ]------------
    WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
    Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
    CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
    Code: ad de 49 be 22 01 00 (...)
    RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
    RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
    RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
    R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
    FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    close_ctree+0x2ba/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f15ee221ee7
    Code: ff 0b 00 f7 d8 64 89 (...)
    RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
    RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
    RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
    RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
    R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
    R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
    irq event stamp: 0
    hardirqs last enabled at (0): [] 0x0
    hardirqs last disabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last enabled at (0): [] copy_process+0x8a0/0x1d70
    softirqs last disabled at (0): [] 0x0
    ---[ end trace dd74718fef1ed5c8 ]---
    BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
    BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
    BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
    BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
    BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
    BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
    BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0

    And the crash, which only happens when we do not have crc32c hardware
    acceleration, produces the following trace immediately after those
    warnings:

    stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
    CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
    Code: 54 55 53 48 89 f3 (...)
    RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
    RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
    RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
    R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
    FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
    btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
    submit_one_bio+0x61/0x70 [btrfs]
    btree_write_cache_pages+0x414/0x450 [btrfs]
    ? kobject_put+0x9a/0x1d0
    ? trace_hardirqs_on+0x1b/0xf0
    ? _raw_spin_unlock_irqrestore+0x3c/0x60
    ? free_debug_processing+0x1e1/0x2b0
    do_writepages+0x43/0xe0
    ? lock_acquired+0x199/0x490
    __writeback_single_inode+0x59/0x650
    writeback_single_inode+0xaf/0x120
    write_inode_now+0x94/0xd0
    iput+0x187/0x2b0
    close_ctree+0x2c6/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f3cfebabee7
    Code: ff 0b 00 f7 d8 64 89 01 (...)
    RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
    RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
    RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
    RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
    R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
    R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
    Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
    ---[ end trace dd74718fef1ed5cc ]---

    Finally when we remove the btrfs module (rmmod btrfs), there are several
    warnings about objects that were allocated from our slabs but were never
    freed, consequence of the transaction that was never committed and got
    leaked:

    =============================================================================
    BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
    -----------------------------------------------------------------------------

    INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
    CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    slab_err+0xb7/0xdc
    ? lock_acquired+0x199/0x490
    __kmem_cache_shutdown+0x1ac/0x3c0
    ? lock_release+0x20e/0x4c0
    kmem_cache_destroy+0x55/0x120
    btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 f5 (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    INFO: Object 0x0000000050cbdd61 @offset=12104
    INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
    __slab_alloc.isra.0+0x109/0x1c0
    kmem_cache_alloc+0x7bb/0x830
    btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
    btrfs_free_tree_block+0x128/0x360 [btrfs]
    __btrfs_cow_block+0x489/0x5f0 [btrfs]
    btrfs_cow_block+0xf7/0x220 [btrfs]
    btrfs_search_slot+0x62a/0xc40 [btrfs]
    btrfs_del_orphan_item+0x65/0xd0 [btrfs]
    btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
    open_ctree+0x125a/0x18a0 [btrfs]
    btrfs_mount_root.cold+0x13/0xed [btrfs]
    legacy_get_tree+0x30/0x60
    vfs_get_tree+0x28/0xe0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    btrfs_mount+0x13b/0x3e0 [btrfs]
    INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
    kmem_cache_free+0x34c/0x3c0
    __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
    btrfs_run_delayed_refs+0x81/0x210 [btrfs]
    commit_cowonly_roots+0xfb/0x300 [btrfs]
    btrfs_commit_transaction+0x367/0xc40 [btrfs]
    sync_filesystem+0x74/0x90
    generic_shutdown_super+0x22/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    INFO: Object 0x0000000086e9b0ff @offset=12776
    INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
    __slab_alloc.isra.0+0x109/0x1c0
    kmem_cache_alloc+0x7bb/0x830
    btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
    btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
    alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
    __btrfs_cow_block+0x12d/0x5f0 [btrfs]
    btrfs_cow_block+0xf7/0x220 [btrfs]
    btrfs_search_slot+0x62a/0xc40 [btrfs]
    btrfs_del_orphan_item+0x65/0xd0 [btrfs]
    btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
    open_ctree+0x125a/0x18a0 [btrfs]
    btrfs_mount_root.cold+0x13/0xed [btrfs]
    legacy_get_tree+0x30/0x60
    vfs_get_tree+0x28/0xe0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
    kmem_cache_free+0x34c/0x3c0
    __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
    btrfs_run_delayed_refs+0x81/0x210 [btrfs]
    btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
    commit_cowonly_roots+0x248/0x300 [btrfs]
    btrfs_commit_transaction+0x367/0xc40 [btrfs]
    close_ctree+0x113/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
    CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    kmem_cache_destroy+0x119/0x120
    btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    =============================================================================
    BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
    -----------------------------------------------------------------------------

    INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
    CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    slab_err+0xb7/0xdc
    ? lock_acquired+0x199/0x490
    __kmem_cache_shutdown+0x1ac/0x3c0
    ? lock_release+0x20e/0x4c0
    kmem_cache_destroy+0x55/0x120
    btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 f5 (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    INFO: Object 0x000000001a340018 @offset=4408
    INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
    __slab_alloc.isra.0+0x109/0x1c0
    kmem_cache_alloc+0x7bb/0x830
    btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
    btrfs_free_tree_block+0x128/0x360 [btrfs]
    __btrfs_cow_block+0x489/0x5f0 [btrfs]
    btrfs_cow_block+0xf7/0x220 [btrfs]
    btrfs_search_slot+0x62a/0xc40 [btrfs]
    btrfs_del_orphan_item+0x65/0xd0 [btrfs]
    btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
    open_ctree+0x125a/0x18a0 [btrfs]
    btrfs_mount_root.cold+0x13/0xed [btrfs]
    legacy_get_tree+0x30/0x60
    vfs_get_tree+0x28/0xe0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    btrfs_mount+0x13b/0x3e0 [btrfs]
    INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
    kmem_cache_free+0x34c/0x3c0
    __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
    btrfs_run_delayed_refs+0x81/0x210 [btrfs]
    btrfs_commit_transaction+0x60/0xc40 [btrfs]
    create_subvol+0x56a/0x990 [btrfs]
    btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
    __btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
    btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
    btrfs_ioctl+0x1a92/0x36f0 [btrfs]
    __x64_sys_ioctl+0x83/0xb0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    INFO: Object 0x000000002b46292a @offset=13648
    INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
    __slab_alloc.isra.0+0x109/0x1c0
    kmem_cache_alloc+0x7bb/0x830
    btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
    btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
    alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
    __btrfs_cow_block+0x12d/0x5f0 [btrfs]
    btrfs_cow_block+0xf7/0x220 [btrfs]
    btrfs_search_slot+0x62a/0xc40 [btrfs]
    btrfs_del_orphan_item+0x65/0xd0 [btrfs]
    btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
    open_ctree+0x125a/0x18a0 [btrfs]
    btrfs_mount_root.cold+0x13/0xed [btrfs]
    legacy_get_tree+0x30/0x60
    vfs_get_tree+0x28/0xe0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
    kmem_cache_free+0x34c/0x3c0
    __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
    btrfs_run_delayed_refs+0x81/0x210 [btrfs]
    commit_cowonly_roots+0xfb/0x300 [btrfs]
    btrfs_commit_transaction+0x367/0xc40 [btrfs]
    close_ctree+0x113/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
    CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    kmem_cache_destroy+0x119/0x120
    btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 f5 (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    =============================================================================
    BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
    -----------------------------------------------------------------------------

    INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
    CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    slab_err+0xb7/0xdc
    ? lock_acquired+0x199/0x490
    __kmem_cache_shutdown+0x1ac/0x3c0
    ? __mutex_unlock_slowpath+0x45/0x2a0
    kmem_cache_destroy+0x55/0x120
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 f5 (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    INFO: Object 0x000000004cf95ea8 @offset=6264
    INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
    __slab_alloc.isra.0+0x109/0x1c0
    kmem_cache_alloc+0x7bb/0x830
    btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
    alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
    __btrfs_cow_block+0x12d/0x5f0 [btrfs]
    btrfs_cow_block+0xf7/0x220 [btrfs]
    btrfs_search_slot+0x62a/0xc40 [btrfs]
    btrfs_del_orphan_item+0x65/0xd0 [btrfs]
    btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
    open_ctree+0x125a/0x18a0 [btrfs]
    btrfs_mount_root.cold+0x13/0xed [btrfs]
    legacy_get_tree+0x30/0x60
    vfs_get_tree+0x28/0xe0
    fc_mount+0xe/0x40
    vfs_kern_mount.part.0+0x71/0x90
    btrfs_mount+0x13b/0x3e0 [btrfs]
    INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
    kmem_cache_free+0x34c/0x3c0
    __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
    btrfs_run_delayed_refs+0x81/0x210 [btrfs]
    commit_cowonly_roots+0xfb/0x300 [btrfs]
    btrfs_commit_transaction+0x367/0xc40 [btrfs]
    close_ctree+0x113/0x2fa [btrfs]
    generic_shutdown_super+0x6c/0x100
    kill_anon_super+0x14/0x30
    btrfs_kill_super+0x12/0x20 [btrfs]
    deactivate_locked_super+0x31/0x70
    cleanup_mnt+0x100/0x160
    task_work_run+0x68/0xb0
    exit_to_user_mode_prepare+0x1bb/0x1c0
    syscall_exit_to_user_mode+0x4b/0x260
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
    CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
    Call Trace:
    dump_stack+0x8d/0xb5
    kmem_cache_destroy+0x119/0x120
    exit_btrfs_fs+0xa/0x59 [btrfs]
    __x64_sys_delete_module+0x194/0x260
    ? fpregs_assert_state_consistent+0x1e/0x40
    ? exit_to_user_mode_prepare+0x55/0x1c0
    ? trace_hardirqs_on+0x1b/0xf0
    do_syscall_64+0x33/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7f693e305897
    Code: 73 01 c3 48 8b 0d f9 (...)
    RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
    RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
    RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
    RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
    R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
    R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
    BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1

    Fix this issue by having the remount path stop the qgroup rescan worker
    when we are remounting RO and teach the rescan worker to stop when a
    remount is in progress. If later a remount in RW mode happens, we are
    already resuming the qgroup rescan worker through the call to
    btrfs_qgroup_rescan_resume(), so we do not need to worry about that.

    Tested-by: Fabian Vogt
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     
  • [ Upstream commit 8fc058597a283e9a37720abb0e8d68e342b9387d ]

    btrfs_discard_workfn() drops discard_ctl->lock just to take it again in
    a moment in btrfs_discard_schedule_work(). Avoid that and also reuse
    ktime.

    Reviewed-by: Josef Bacik
    Signed-off-by: Pavel Begunkov
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     
  • [ Upstream commit ea9ed87c73e87e044b2c58d658eb4ba5216bc488 ]

    Might happen that bg->discard_eligible_time was changed without
    rescheduling, so btrfs_discard_workfn() wakes up earlier than that new
    time, peek_discard_list() returns NULL, and all work halts and goes to
    sleep without further rescheduling even there are block groups to
    discard.

    It happens pretty often, but not so visible from the userspace because
    after some time it usually will be kicked off anyway by someone else
    calling btrfs_discard_reschedule_work().

    Fix it by continue rescheduling if block group discard lists are not
    empty.

    Reviewed-by: Josef Bacik
    Signed-off-by: Pavel Begunkov
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     
  • [ Upstream commit d434ab6db524ab1efd0afad4ffa1ee65ca6ac097 ]

    __io_req_task_submit() run by task_work can set mm and files, but
    io_sq_thread() in some cases, and because __io_sq_thread_acquire_mm()
    and __io_sq_thread_acquire_files() do a simple current->mm/files check
    it may end up submitting IO with mm/files of another task.

    We also need to drop it after in the end to drop potentially grabbed
    references to them.

    Cc: stable@vger.kernel.org # 5.9+
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     
  • [ Upstream commit 621fadc22365f3cf307bcd9048e3372e9ee9cdcc ]

    In rare cases a task may be exiting while io_ring_exit_work() trying to
    cancel/wait its requests. It's ok for __io_sq_thread_acquire_mm()
    because of SQPOLL check, but is not for __io_sq_thread_acquire_files().
    Play safe and fail for both of them.

    Cc: stable@vger.kernel.org # 5.5+
    Signed-off-by: Pavel Begunkov
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Pavel Begunkov
     
  • [ Upstream commit 5a3b590d4b2db187faa6f06adc9a53d6199fb1f9 ]

    When the first file is opened, ext4 samples the mountpoint of the
    filesystem in 64 bytes of the super block. It does so using
    strlcpy(), this means that the remaining bytes in the super block
    string buffer are untouched. If the mount point before had a longer
    path than the current one, it can be reconstructed.

    Consider the case where the fs was mounted to "/media/johnjdeveloper"
    and later to "/". The super block buffer then contains
    "/\x00edia/johnjdeveloper".

    This case was seen in the wild and caused confusion how the name
    of a developer ands up on the super block of a filesystem used
    in production...

    Fix this by using strncpy() instead of strlcpy(). The superblock
    field is defined to be a fixed-size char array, and it is already
    marked using __nonstring in fs/ext4/ext4.h. The consumer of the field
    in e2fsprogs already assumes that in the case of a 64+ byte mount
    path, that s_last_mounted will not be NUL terminated.

    Link: https://lore.kernel.org/r/X9ujIOJG/HqMr88R@mit.edu
    Reported-by: Richard Weinberger
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Sasha Levin

    Theodore Ts'o
     
  • [ Upstream commit 347fb0cfc9bab5195c6701e62eda488310d7938f ]

    While mounting a crafted image provided by user, kernel panics due to
    the invalid chunk item whose end is less than start.

    [66.387422] loop: module loaded
    [66.389773] loop0: detected capacity change from 262144 to 0
    [66.427708] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 12 /dev/loop0 scanned by mount (613)
    [66.431061] BTRFS info (device loop0): disk space caching is enabled
    [66.431078] BTRFS info (device loop0): has skinny extents
    [66.437101] BTRFS error: insert state: end < start 29360127 37748736
    [66.437136] ------------[ cut here ]------------
    [66.437140] WARNING: CPU: 16 PID: 613 at fs/btrfs/extent_io.c:557 insert_state.cold+0x1a/0x46 [btrfs]
    [66.437369] CPU: 16 PID: 613 Comm: mount Tainted: G O 5.11.0-rc1-custom #45
    [66.437374] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
    [66.437378] RIP: 0010:insert_state.cold+0x1a/0x46 [btrfs]
    [66.437420] RSP: 0018:ffff93e5414c3908 EFLAGS: 00010286
    [66.437427] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
    [66.437431] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
    [66.437434] RBP: ffff93e5414c3938 R08: 0000000000000001 R09: 0000000000000001
    [66.437438] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d72aa0
    [66.437441] R13: ffff8ec78bc71628 R14: 0000000000000000 R15: 0000000002400000
    [66.437447] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
    [66.437451] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [66.437455] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
    [66.437460] PKRU: 55555554
    [66.437464] Call Trace:
    [66.437475] set_extent_bit+0x652/0x740 [btrfs]
    [66.437539] set_extent_bits_nowait+0x1d/0x20 [btrfs]
    [66.437576] add_extent_mapping+0x1e0/0x2f0 [btrfs]
    [66.437621] read_one_chunk+0x33c/0x420 [btrfs]
    [66.437674] btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
    [66.437708] ? kvm_sched_clock_read+0x18/0x40
    [66.437739] open_ctree+0xb32/0x1734 [btrfs]
    [66.437781] ? bdi_register_va+0x1b/0x20
    [66.437788] ? super_setup_bdi_name+0x79/0xd0
    [66.437810] btrfs_mount_root.cold+0x12/0xeb [btrfs]
    [66.437854] ? __kmalloc_track_caller+0x217/0x3b0
    [66.437873] legacy_get_tree+0x34/0x60
    [66.437880] vfs_get_tree+0x2d/0xc0
    [66.437888] vfs_kern_mount.part.0+0x78/0xc0
    [66.437897] vfs_kern_mount+0x13/0x20
    [66.437902] btrfs_mount+0x11f/0x3c0 [btrfs]
    [66.437940] ? kfree+0x5ff/0x670
    [66.437944] ? __kmalloc_track_caller+0x217/0x3b0
    [66.437962] legacy_get_tree+0x34/0x60
    [66.437974] vfs_get_tree+0x2d/0xc0
    [66.437983] path_mount+0x48c/0xd30
    [66.437998] __x64_sys_mount+0x108/0x140
    [66.438011] do_syscall_64+0x38/0x50
    [66.438018] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [66.438023] RIP: 0033:0x7f0138827f6e
    [66.438033] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
    [66.438040] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
    [66.438044] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
    [66.438047] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
    [66.438050] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    [66.438054] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
    [66.438078] irq event stamp: 18169
    [66.438082] hardirqs last enabled at (18175): [] console_unlock+0x4ff/0x5f0
    [66.438088] hardirqs last disabled at (18180): [] console_unlock+0x467/0x5f0
    [66.438092] softirqs last enabled at (16910): [] asm_call_irq_on_stack+0x12/0x20
    [66.438097] softirqs last disabled at (16905): [] asm_call_irq_on_stack+0x12/0x20
    [66.438103] ---[ end trace e114b111db64298b ]---
    [66.438107] BTRFS error: found node 12582912 29360127 on insert of 37748736 29360127
    [66.438127] BTRFS critical: panic in extent_io_tree_panic:679: locking error: extent tree was modified by another thread while locked (errno=-17 Object already exists)
    [66.441069] ------------[ cut here ]------------
    [66.441072] kernel BUG at fs/btrfs/extent_io.c:679!
    [66.442064] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
    [66.443018] CPU: 16 PID: 613 Comm: mount Tainted: G W O 5.11.0-rc1-custom #45
    [66.444538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
    [66.446223] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
    [66.450878] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
    [66.451840] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
    [66.453141] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
    [66.454445] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
    [66.455743] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
    [66.457055] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
    [66.458356] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
    [66.459841] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [66.460895] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
    [66.462196] PKRU: 55555554
    [66.462692] Call Trace:
    [66.463139] set_extent_bit.cold+0x30/0x98 [btrfs]
    [66.464049] set_extent_bits_nowait+0x1d/0x20 [btrfs]
    [66.490466] add_extent_mapping+0x1e0/0x2f0 [btrfs]
    [66.514097] read_one_chunk+0x33c/0x420 [btrfs]
    [66.534976] btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
    [66.555718] ? kvm_sched_clock_read+0x18/0x40
    [66.575758] open_ctree+0xb32/0x1734 [btrfs]
    [66.595272] ? bdi_register_va+0x1b/0x20
    [66.614638] ? super_setup_bdi_name+0x79/0xd0
    [66.633809] btrfs_mount_root.cold+0x12/0xeb [btrfs]
    [66.652938] ? __kmalloc_track_caller+0x217/0x3b0
    [66.671925] legacy_get_tree+0x34/0x60
    [66.690300] vfs_get_tree+0x2d/0xc0
    [66.708221] vfs_kern_mount.part.0+0x78/0xc0
    [66.725808] vfs_kern_mount+0x13/0x20
    [66.742730] btrfs_mount+0x11f/0x3c0 [btrfs]
    [66.759350] ? kfree+0x5ff/0x670
    [66.775441] ? __kmalloc_track_caller+0x217/0x3b0
    [66.791750] legacy_get_tree+0x34/0x60
    [66.807494] vfs_get_tree+0x2d/0xc0
    [66.823349] path_mount+0x48c/0xd30
    [66.838753] __x64_sys_mount+0x108/0x140
    [66.854412] do_syscall_64+0x38/0x50
    [66.869673] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [66.885093] RIP: 0033:0x7f0138827f6e
    [66.945613] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
    [66.977214] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
    [66.994266] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
    [67.011544] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
    [67.028836] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
    [67.045812] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
    [67.216138] ---[ end trace e114b111db64298c ]---
    [67.237089] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
    [67.325317] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
    [67.347946] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
    [67.371343] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
    [67.394757] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
    [67.418409] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
    [67.441906] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
    [67.465436] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
    [67.511660] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [67.535047] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
    [67.558449] PKRU: 55555554
    [67.581146] note: mount[613] exited with preempt_count 2

    The image has a chunk item which has a logical start 37748736 and length
    18446744073701163008 (-8M). The calculated end 29360127 overflows.
    EEXIST was caught by insert_state() because of the duplicate end and
    extent_io_tree_panic() was called.

    Add overflow check of chunk item end to tree checker so it can be
    detected early at mount time.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Anand Jain
    Signed-off-by: Su Yue
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Su Yue
     
  • commit 2659d3bff3e1b000f49907d0839178b101a89887 upstream.

    Retry close command if it gets interrupted to not leak open handles on
    the server.

    Signed-off-by: Paulo Alcantara (SUSE)
    Reported-by: Duncan Findlay
    Suggested-by: Pavel Shilovsky
    Fixes: 6988a619f5b7 ("cifs: allow syscalls to be restarted in __smb_send_rqst()")
    Cc: stable@vger.kernel.org
    Reviewd-by: Pavel Shilovsky
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Paulo Alcantara
     
  • commit 77b6ec01c29aade01701aa30bf1469acc7f2be76 upstream.

    clang static analysis reports this problem

    dfs_cache.c:591:2: warning: Argument to kfree() is a constant address
    (18446744073709551614), which is not memory allocated by malloc()
    kfree(vi);
    ^~~~~~~~~

    In dfs_cache_del_vol() the volume info pointer 'vi' being freed
    is the return of a call to find_vol(). The large constant address
    is find_vol() returning an error.

    Add an error check to dfs_cache_del_vol() similar to the one done
    in dfs_cache_update_vol().

    Fixes: 54be1f6c1c37 ("cifs: Add DFS cache routines")
    Signed-off-by: Tom Rix
    Reviewed-by: Nathan Chancellor
    CC: # v5.0+
    Signed-off-by: Steve French
    Signed-off-by: Greg Kroah-Hartman

    Tom Rix
     
  • commit 6b4b8e6b4ad8553660421d6360678b3811d5deb9 upstream.

    We got a "deleted inode referenced" warning cross our fsstress test. The
    bug can be reproduced easily with following steps:

    cd /dev/shm
    mkdir test/
    fallocate -l 128M img
    mkfs.ext4 -b 1024 img
    mount img test/
    dd if=/dev/zero of=test/foo bs=1M count=128
    mkdir test/dir/ && cd test/dir/
    for ((i=0;i
    Signed-off-by: Greg Kroah-Hartman

    yangerkun
     
  • commit 31e203e09f036f48e7c567c2d32df0196bbd303f upstream.

    After full/fast commit, entries in staging queue are promoted to main
    queue. In ext4_fs_cleanup function, it splice to staging queue to
    staging queue.

    Fixes: aa75f4d3daaeb ("ext4: main fast-commit commit path")
    Signed-off-by: Daejun Park
    Reviewed-by: Harshad Shirwadkar
    Link: https://lore.kernel.org/r/20201230094851epcms2p6eeead8cc984379b37b2efd21af90fd1a@epcms2p6
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Daejun Park
     
  • commit 23dd561ad9eae02b4d51bb502fe4e1a0666e9567 upstream.

    1: ext4_iget/ext4_find_extent never returns NULL, use IS_ERR
    instead of IS_ERR_OR_NULL to fix this.

    2: ext4_fc_replay_inode should set the inode to NULL when IS_ERR.
    and go to call iput properly.

    Fixes: 8016e29f4362 ("ext4: fast commit recovery path")
    Signed-off-by: Yi Li
    Reviewed-by: Jan Kara
    Link: https://lore.kernel.org/r/20201230033827.3996064-1-yili@winhong.com
    Signed-off-by: Theodore Ts'o
    Cc: stable@kernel.org
    Signed-off-by: Greg Kroah-Hartman

    Yi Li
     
  • commit 29b665cc51e8b602bf2a275734349494776e3dbc upstream.

    Some extent io trees are initialized with NULL private member (e.g.
    btrfs_device::alloc_state and btrfs_fs_info::excluded_extents).
    Dereference of a NULL tree->private as inode pointer will cause panic.

    Pass tree->fs_info as it's known to be valid in all cases.

    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
    Fixes: 05912a3c04eb ("btrfs: drop extent_io_ops::tree_fs_info callback")
    CC: stable@vger.kernel.org # 4.19+
    Reviewed-by: Anand Jain
    Signed-off-by: Su Yue
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Su Yue
     
  • commit 50e31ef486afe60f128d42fb9620e2a63172c15c upstream.

    [BUG]
    There are several bug reports about recent kernel unable to relocate
    certain data block groups.

    Sometimes the error just goes away, but there is one reporter who can
    reproduce it reliably.

    The dmesg would look like:

    [438.260483] BTRFS info (device dm-10): balance: start -dvrange=34625344765952..34625344765953
    [438.269018] BTRFS info (device dm-10): relocating block group 34625344765952 flags data|raid1
    [450.439609] BTRFS info (device dm-10): found 167 extents, stage: move data extents
    [463.501781] BTRFS info (device dm-10): balance: ended with status: -2

    [CAUSE]
    The ENOENT error is returned from the following call chain:

    add_data_references()
    |- delete_v1_space_cache();
    |- if (!found)
    return -ENOENT;

    The variable @found is set to true if we find a data extent whose
    disk bytenr matches parameter @data_bytes.

    With extra debugging, the offending tree block looks like this:

    leaf bytenr = 42676709441536, data_bytenr = 34626327621632

    ctime 1567904822.739884119 (2019-09-08 03:07:02)
    mtime 0.0 (1970-01-01 01:00:00)
    otime 0.0 (1970-01-01 01:00:00)
    item 27 key (51933 EXTENT_DATA 0) itemoff 9854 itemsize 53
    generation 1517381 type 2 (prealloc)
    prealloc data disk byte 34626327621632 nr 262144 <<<
    prealloc data offset 0 nr 262144
    item 28 key (52262 ROOT_ITEM 0) itemoff 9415 itemsize 439
    generation 2618893 root_dirid 256 bytenr 42677048360960 level 3 refs 1
    lastsnap 2618893 byte_limit 0 bytes_used 5557338112 flags 0x0(none)
    uuid d0d4361f-d231-6d40-8901-fe506e4b2b53

    Although item 27 has disk bytenr 34626327621632, which matches the
    data_bytenr, its type is prealloc, not reg.
    This makes the existing code skip that item, and return ENOENT.

    [FIX]
    The code is modified in commit 19b546d7a1b2 ("btrfs: relocation: Use
    btrfs_find_all_leafs to locate data extent parent tree leaves"), before
    that commit, we use something like

    "if (type == BTRFS_FILE_EXTENT_INLINE) continue;"

    But in that offending commit, we use (type == BTRFS_FILE_EXTENT_REG),
    ignoring BTRFS_FILE_EXTENT_PREALLOC.

    Fix it by also checking BTRFS_FILE_EXTENT_PREALLOC.

    Reported-by: Stéphane Lesimple
    Link: https://lore.kernel.org/linux-btrfs/505cabfa88575ed6dbe7cb922d8914fb@lesimple.fr
    Fixes: 19b546d7a1b2 ("btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves")
    CC: stable@vger.kernel.org # 5.6+
    Tested-By: Stéphane Lesimple
    Reviewed-by: Su Yue
    Signed-off-by: Qu Wenruo
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Greg Kroah-Hartman

    Qu Wenruo
     

19 Jan, 2021

1 commit

  • Changes in 5.10.8
    powerpc/32s: Fix RTAS machine check with VMAP stack
    io_uring: synchronise IOPOLL on task_submit fail
    io_uring: limit {io|sq}poll submit locking scope
    io_uring: patch up IOPOLL overflow_flush sync
    RDMA/hns: Avoid filling sl in high 3 bits of vlan_id
    iommu/arm-smmu-qcom: Initialize SCTLR of the bypass context
    drm/panfrost: Don't corrupt the queue mutex on open/close
    io_uring: Fix return value from alloc_fixed_file_ref_node
    scsi: ufs: Fix -Wsometimes-uninitialized warning
    btrfs: skip unnecessary searches for xattrs when logging an inode
    btrfs: fix deadlock when cloning inline extent and low on free metadata space
    btrfs: shrink delalloc pages instead of full inodes
    net: cdc_ncm: correct overhead in delayed_ndp_size
    net: hns3: fix incorrect handling of sctp6 rss tuple
    net: hns3: fix the number of queues actually used by ARQ
    net: hns3: fix a phy loopback fail issue
    net: stmmac: dwmac-sun8i: Fix probe error handling
    net: stmmac: dwmac-sun8i: Balance internal PHY resource references
    net: stmmac: dwmac-sun8i: Balance internal PHY power
    net: stmmac: dwmac-sun8i: Balance syscon (de)initialization
    net: vlan: avoid leaks on register_vlan_dev() failures
    net/sonic: Fix some resource leaks in error handling paths
    net: bareudp: add missing error handling for bareudp_link_config()
    ptp: ptp_ines: prevent build when HAS_IOMEM is not set
    net: ipv6: fib: flush exceptions when purging route
    tools: selftests: add test for changing routes with PTMU exceptions
    net: fix pmtu check in nopmtudisc mode
    net: ip: always refragment ip defragmented packets
    chtls: Fix hardware tid leak
    chtls: Remove invalid set_tcb call
    chtls: Fix panic when route to peer not configured
    chtls: Avoid unnecessary freeing of oreq pointer
    chtls: Replace skb_dequeue with skb_peek
    chtls: Added a check to avoid NULL pointer dereference
    chtls: Fix chtls resources release sequence
    octeontx2-af: fix memory leak of lmac and lmac->name
    nexthop: Fix off-by-one error in error path
    nexthop: Unlink nexthop group entry in error path
    nexthop: Bounce NHA_GATEWAY in FDB nexthop groups
    s390/qeth: fix deadlock during recovery
    s390/qeth: fix locking for discipline setup / removal
    s390/qeth: fix L2 header access in qeth_l3_osa_features_check()
    net: dsa: lantiq_gswip: Exclude RMII from modes that report 1 GbE
    net/mlx5: Use port_num 1 instead of 0 when delete a RoCE address
    net/mlx5e: ethtool, Fix restriction of autoneg with 56G
    net/mlx5e: In skb build skip setting mark in switchdev mode
    net/mlx5: Check if lag is supported before creating one
    scsi: lpfc: Fix variable 'vport' set but not used in lpfc_sli4_abts_err_handler()
    ionic: start queues before announcing link up
    HID: wacom: Fix memory leakage caused by kfifo_alloc
    fanotify: Fix sys_fanotify_mark() on native x86-32
    ARM: OMAP2+: omap_device: fix idling of devices during probe
    i2c: sprd: use a specific timeout to avoid system hang up issue
    dmaengine: dw-edma: Fix use after free in dw_edma_alloc_chunk()
    selftests/bpf: Clarify build error if no vmlinux
    can: tcan4x5x: fix bittiming const, use common bittiming from m_can driver
    can: m_can: m_can_class_unregister(): remove erroneous m_can_clk_stop()
    can: kvaser_pciefd: select CONFIG_CRC32
    spi: spi-geni-qcom: Fail new xfers if xfer/cancel/abort pending
    cpufreq: powernow-k8: pass policy rather than use cpufreq_cpu_get()
    spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case
    spi: stm32: FIFO threshold level - fix align packet size
    i2c: i801: Fix the i2c-mux gpiod_lookup_table not being properly terminated
    i2c: mediatek: Fix apdma and i2c hand-shake timeout
    bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET
    interconnect: imx: Add a missing of_node_put after of_device_is_available
    interconnect: qcom: fix rpmh link failures
    dmaengine: mediatek: mtk-hsdma: Fix a resource leak in the error handling path of the probe function
    dmaengine: milbeaut-xdmac: Fix a resource leak in the error handling path of the probe function
    dmaengine: xilinx_dma: check dma_async_device_register return value
    dmaengine: xilinx_dma: fix incompatible param warning in _child_probe()
    dmaengine: xilinx_dma: fix mixed_enum_type coverity warning
    arm64: mm: Fix ARCH_LOW_ADDRESS_LIMIT when !CONFIG_ZONE_DMA
    qed: select CONFIG_CRC32
    phy: dp83640: select CONFIG_CRC32
    wil6210: select CONFIG_CRC32
    block: rsxx: select CONFIG_CRC32
    lightnvm: select CONFIG_CRC32
    zonefs: select CONFIG_CRC32
    iommu/vt-d: Fix misuse of ALIGN in qi_flush_piotlb()
    iommu/intel: Fix memleak in intel_irq_remapping_alloc
    bpftool: Fix compilation failure for net.o with older glibc
    nvme-tcp: Fix possible race of io_work and direct send
    net/mlx5e: Fix memleak in mlx5e_create_l2_table_groups
    net/mlx5e: Fix two double free cases
    regmap: debugfs: Fix a memory leak when calling regmap_attach_dev
    wan: ds26522: select CONFIG_BITREVERSE
    arm64: cpufeature: remove non-exist CONFIG_KVM_ARM_HOST
    regulator: qcom-rpmh-regulator: correct hfsmps515 definition
    net: mvpp2: disable force link UP during port init procedure
    drm/i915/dp: Track pm_qos per connector
    net: mvneta: fix error message when MTU too large for XDP
    selftests: fib_nexthops: Fix wrong mausezahn invocation
    KVM: arm64: Don't access PMCR_EL0 when no PMU is available
    xsk: Fix race in SKB mode transmit with shared cq
    xsk: Rollback reservation at NETDEV_TX_BUSY
    block/rnbd-clt: avoid module unload race with close confirmation
    can: isotp: isotp_getname(): fix kernel information leak
    block: fix use-after-free in disk_part_iter_next
    net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed packet
    regmap: debugfs: Fix a reversed if statement in regmap_debugfs_init()
    drm/panfrost: Remove unused variables in panfrost_job_close()
    tools headers UAPI: Sync linux/fscrypt.h with the kernel sources
    Linux 5.10.8

    Signed-off-by: Greg Kroah-Hartman
    Change-Id: Ib8272ec9f47a3c3813509bcacece3b16137332e1

    Greg Kroah-Hartman
     

17 Jan, 2021

5 commits

  • commit 4f8b848788f77c7f5c3bd98febce66b7aa14785f upstream.

    When CRC32 is disabled, zonefs cannot be linked:

    ld: fs/zonefs/super.o: in function `zonefs_fill_super':

    Add a Kconfig 'select' statement for it.

    Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Damien Le Moal
    Signed-off-by: Greg Kroah-Hartman

    Arnd Bergmann
     
  • commit 2ca408d9c749c32288bc28725f9f12ba30299e8f upstream.

    Commit

    121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments")

    converted native x86-32 which take 64-bit arguments to use the
    compat handlers to allow conversion to passing args via pt_regs.
    sys_fanotify_mark() was however missed, as it has a general compat
    handler. Add a config option that will use the syscall wrapper that
    takes the split args for native 32-bit.

    [ bp: Fix typo in Kconfig help text. ]

    Fixes: 121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments")
    Reported-by: Paweł Jasiak
    Signed-off-by: Brian Gerst
    Signed-off-by: Borislav Petkov
    Acked-by: Jan Kara
    Acked-by: Andy Lutomirski
    Link: https://lkml.kernel.org/r/20201130223059.101286-1-brgerst@gmail.com
    Signed-off-by: Greg Kroah-Hartman

    Brian Gerst
     
  • [ Upstream commit e076ab2a2ca70a0270232067cd49f76cd92efe64 ]

    Commit 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
    shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
    some infrastructure we have in place to flush inodes that we use for
    device replace and snapshot. However this introduced a pretty serious
    performance regression. To reproduce the user untarred the source
    tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
    see it take anywhere from 5 to 20 times as long to untar in 5.10
    compared to 5.9. This was observed on fast devices (SSD and better) and
    not on HDD.

    The root cause is because before we would generally use the normal
    writeback path to reclaim delalloc space, and for this we would provide
    it with the number of pages we wanted to flush. The referenced commit
    changed this to flush that many inodes, which drastically increased the
    amount of space we were flushing in certain cases, which severely
    affected performance.

    We cannot revert this patch unfortunately because of 3d45f221ce62
    ("btrfs: fix deadlock when cloning inline extent and low on free
    metadata space") which requires the ability to skip flushing inodes that
    are being cloned in certain scenarios, which means we need to keep using
    our flushing infrastructure or risk re-introducing the deadlock.

    Instead to fix this problem we can go back to providing
    btrfs_start_delalloc_roots with a number of pages to flush, and then set
    up a writeback_control and utilize sync_inode() to handle the flushing
    for us. This gives us the same behavior we had prior to the fix, while
    still allowing us to avoid the deadlock that was fixed by Filipe. I
    redid the users original test and got the following results on one of
    our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)

    5.9 0m54.258s
    5.10 1m26.212s
    5.10+patch 0m38.800s

    5.10+patch is significantly faster than plain 5.9 because of my patch
    series "Change data reservations to use the ticketing infra" which
    contained the patch that introduced the regression, but generally
    improved the overall ENOSPC flushing mechanisms.

    Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
    the results:

    5.10.5 4m00s
    5.10.5+patch 1m08s
    5.11-rc2 5m14s
    5.11-rc2+patch 1m30s

    Reported-by: René Rebe
    Fixes: 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
    CC: stable@vger.kernel.org # 5.10
    Signed-off-by: Josef Bacik
    Tested-by: David Sterba
    Reviewed-by: David Sterba
    [ add my test results ]
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Josef Bacik
     
  • [ Upstream commit 3d45f221ce627d13e2e6ef3274f06750c84a6542 ]

    When cloning an inline extent there are cases where we can not just copy
    the inline extent from the source range to the target range (e.g. when the
    target range starts at an offset greater than zero). In such cases we copy
    the inline extent's data into a page of the destination inode and then
    dirty that page. However, after that we will need to start a transaction
    for each processed extent and, if we are ever low on available metadata
    space, we may need to flush existing delalloc for all dirty inodes in an
    attempt to release metadata space - if that happens we may deadlock:

    * the async reclaim task queued a delalloc work to flush delalloc for
    the destination inode of the clone operation;

    * the task executing that delalloc work gets blocked waiting for the
    range with the dirty page to be unlocked, which is currently locked
    by the task doing the clone operation;

    * the async reclaim task blocks waiting for the delalloc work to complete;

    * the cloning task is waiting on the waitqueue of its reservation ticket
    while holding the range with the dirty page locked in the inode's
    io_tree;

    * if metadata space is not released by some other task (like delalloc for
    some other inode completing for example), the clone task waits forever
    and as a consequence the delalloc work and async reclaim tasks will hang
    forever as well. Releasing more space on the other hand may require
    starting a transaction, which will hang as well when trying to reserve
    metadata space, resulting in a deadlock between all these tasks.

    When this happens, traces like the following show up in dmesg/syslog:

    [87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
    [87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    [87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
    [87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
    [87452.326136] Call Trace:
    [87452.326737] __schedule+0x5d1/0xcf0
    [87452.327390] schedule+0x45/0xe0
    [87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs]
    [87452.328894] ? finish_wait+0x90/0x90
    [87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs]
    [87452.330133] ? __mod_memcg_state+0x8e/0x160
    [87452.330738] __extent_writepage+0x2d4/0x400 [btrfs]
    [87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs]
    [87452.332007] ? lock_release+0x20e/0x4c0
    [87452.332557] ? trace_hardirqs_on+0x1b/0xf0
    [87452.333127] extent_writepages+0x43/0x90 [btrfs]
    [87452.333653] ? lock_acquire+0x1a3/0x490
    [87452.334177] do_writepages+0x43/0xe0
    [87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100
    [87452.335720] __filemap_fdatawrite_range+0xc5/0x100
    [87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
    [87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs]
    [87452.337838] process_one_work+0x24e/0x5e0
    [87452.338437] worker_thread+0x50/0x3b0
    [87452.339137] ? process_one_work+0x5e0/0x5e0
    [87452.339884] kthread+0x153/0x170
    [87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0
    [87452.341153] ret_from_fork+0x22/0x30
    [87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
    [87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
    [87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    [87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
    [87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
    [87452.345655] Call Trace:
    [87452.346305] __schedule+0x5d1/0xcf0
    [87452.346947] ? kvm_clock_read+0x14/0x30
    [87452.347676] ? wait_for_completion+0x81/0x110
    [87452.348389] schedule+0x45/0xe0
    [87452.349077] schedule_timeout+0x30c/0x580
    [87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [87452.350340] ? lock_acquire+0x1a3/0x490
    [87452.351006] ? try_to_wake_up+0x7a/0xa20
    [87452.351541] ? lock_release+0x20e/0x4c0
    [87452.352040] ? lock_acquired+0x199/0x490
    [87452.352517] ? wait_for_completion+0x81/0x110
    [87452.353000] wait_for_completion+0xab/0x110
    [87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs]
    [87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
    [87452.354455] flush_space+0x24f/0x660 [btrfs]
    [87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
    [87452.355565] process_one_work+0x24e/0x5e0
    [87452.356024] worker_thread+0x20f/0x3b0
    [87452.356487] ? process_one_work+0x5e0/0x5e0
    [87452.356973] kthread+0x153/0x170
    [87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0
    [87452.357880] ret_from_fork+0x22/0x30
    (...)
    < stack traces of several tasks waiting for the locks of the inodes of the
    clone operation >
    (...)
    [92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
    [92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
    [92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
    [92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
    [92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
    [92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
    [92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000
    [92867.447920] Call Trace:
    [92867.448435] __schedule+0x5d1/0xcf0
    [92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60
    [92867.449423] schedule+0x45/0xe0
    [92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs]
    [92867.450576] ? finish_wait+0x90/0x90
    [92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
    [92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs]
    [92867.452412] start_transaction+0x2d1/0x760 [btrfs]
    [92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs]
    [92867.453848] ? lock_release+0x20e/0x4c0
    [92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
    [92867.455218] btrfs_clone+0x569/0x7e0 [btrfs]
    [92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs]
    [92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs]
    [92867.457213] do_clone_file_range+0xd4/0x1f0
    [92867.457828] vfs_clone_file_range+0x4d/0x230
    [92867.458355] ? lock_release+0x20e/0x4c0
    [92867.458890] ioctl_file_clone+0x8f/0xc0
    [92867.459377] do_vfs_ioctl+0x342/0x750
    [92867.459913] __x64_sys_ioctl+0x62/0xb0
    [92867.460377] do_syscall_64+0x33/0x80
    [92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    (...)
    < stack traces of more tasks blocked on metadata reservation like the clone
    task above, because the async reclaim task has deadlocked >
    (...)

    Another thing to notice is that the worker task that is deadlocked when
    trying to flush the destination inode of the clone operation is at
    btrfs_invalidatepage(). This is simply because the clone operation has a
    destination offset greater than the i_size and we only update the i_size
    of the destination file after cloning an extent (just like we do in the
    buffered write path).

    Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
    the flushing of delalloc for all inodes that have delalloc, add a runtime
    flag to an inode to signal it should not be flushed, and for inodes with
    that flag set, start_delalloc_inodes() will simply skip them. When the
    cloning code needs to dirty a page to copy an inline extent, set that flag
    on the inode and then clear it when the clone operation finishes.

    This could be sporadically triggered with test case generic/269 from
    fstests, which exercises many fsstress processes running in parallel with
    several dd processes filling up the entire filesystem.

    CC: stable@vger.kernel.org # 5.9+
    Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents")
    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana
     
  • [ Upstream commit f2f121ab500d0457cc9c6f54269d21ffdf5bd304 ]

    Every time we log an inode we lookup in the fs/subvol tree for xattrs and
    if we have any, log them into the log tree. However it is very common to
    have inodes without any xattrs, so doing the search wastes times, but more
    importantly it adds contention on the fs/subvol tree locks, either making
    the logging code block and wait for tree locks or making the logging code
    making other concurrent operations block and wait.

    The most typical use cases where xattrs are used are when capabilities or
    ACLs are defined for an inode, or when SELinux is enabled.

    This change makes the logging code detect when an inode does not have
    xattrs and skip the xattrs search the next time the inode is logged,
    unless the inode is evicted and loaded again or a xattr is added to the
    inode. Therefore skipping the search for xattrs on inodes that don't ever
    have xattrs and are fsynced with some frequency.

    The following script that calls dbench was used to measure the impact of
    this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
    directly (no intermediary filesystem on the host) and using a non-debug
    kernel (default configuration on Debian distributions):

    $ cat test.sh
    #!/bin/bash

    DEV=/dev/sdk
    MNT=/mnt/sdk
    MOUNT_OPTIONS="-o ssd"

    mkfs.btrfs -f -m single -d single $DEV
    mount $MOUNT_OPTIONS $DEV $MNT

    dbench -D $MNT -t 200 40

    umount $MNT

    The results before this change:

    Operation Count AvgLat MaxLat
    ----------------------------------------
    NTCreateX 5761605 0.172 312.057
    Close 4232452 0.002 10.927
    Rename 243937 1.406 277.344
    Unlink 1163456 0.631 298.402
    Deltree 160 11.581 221.107
    Mkdir 80 0.003 0.005
    Qpathinfo 5221410 0.065 122.309
    Qfileinfo 915432 0.001 3.333
    Qfsinfo 957555 0.003 3.992
    Sfileinfo 469244 0.023 20.494
    Find 2018865 0.448 123.659
    WriteX 2874851 0.049 118.529
    ReadX 9030579 0.004 21.654
    LockX 18754 0.003 4.423
    UnlockX 18754 0.002 0.331
    Flush 403792 10.944 359.494

    Throughput 908.444 MB/sec 40 clients 40 procs max_latency=359.500 ms

    The results after this change:

    Operation Count AvgLat MaxLat
    ----------------------------------------
    NTCreateX 6442521 0.159 230.693
    Close 4732357 0.002 10.972
    Rename 272809 1.293 227.398
    Unlink 1301059 0.563 218.500
    Deltree 160 7.796 54.887
    Mkdir 80 0.008 0.478
    Qpathinfo 5839452 0.047 124.330
    Qfileinfo 1023199 0.001 4.996
    Qfsinfo 1070760 0.003 5.709
    Sfileinfo 524790 0.033 21.765
    Find 2257658 0.314 125.611
    WriteX 3211520 0.040 232.135
    ReadX 10098969 0.004 25.340
    LockX 20974 0.003 1.569
    UnlockX 20974 0.002 3.475
    Flush 451553 10.287 331.037

    Throughput 1011.77 MB/sec 40 clients 40 procs max_latency=331.045 ms

    +10.8% throughput, -8.2% max latency

    Reviewed-by: Josef Bacik
    Signed-off-by: Filipe Manana
    Reviewed-by: David Sterba
    Signed-off-by: David Sterba
    Signed-off-by: Sasha Levin

    Filipe Manana