Eric Lee / smarc-fsl-linux-kernel

27 Jan, 2021

3 commits

168a59bab UPSTREAM: kernfs: wire up ->splice_read and ->splice_write ... Browse Code »

Wire up the splice_read and splice_write methods to the default
helpers using ->read_iter and ->write_iter now that those are
implemented for kernfs. This restores support to use splice and
sendfile on kernfs files.

Fixes: 36e2c7421f02 ("fs: don't allow splice read/write without explicit ops")
Reported-by: Siddharth Gupta
Tested-by: Siddharth Gupta
Signed-off-by: Christoph Hellwig
Link: https://lore.kernel.org/r/20210120204631.274206-4-hch@lst.de
Signed-off-by: Greg Kroah-Hartman
(cherry picked from commit f2d6c2708bd84ca953fa6b6ca5717e79eb0140c7)
Bug: 176079972
Signed-off-by: Greg Kroah-Hartman
Change-Id: I80f09db0b8569fa63db59ada9c3d45d6600b2d70

Christoph Hellwig
2021-01-27 15:04:54 +0800
69427f456 UPSTREAM: kernfs: implement ->write_iter ... Browse Code »

Switch kernfs to implement the write_iter method instead of plain old
write to prepare to supporting splice and sendfile again.

Signed-off-by: Christoph Hellwig
Link: https://lore.kernel.org/r/20210120204631.274206-3-hch@lst.de
Signed-off-by: Greg Kroah-Hartman
(cherry picked from commit cc099e0b399889c6485c88368b19824b087c9f8c)
Signed-off-by: Greg Kroah-Hartman
Change-Id: I660c93c520169534c874cef2d246fef8c8fe2bbc

Christoph Hellwig
2021-01-27 15:04:54 +0800
b7fdb6afb UPSTREAM: kernfs: implement ->read_iter ... Browse Code »

Switch kernfs to implement the read_iter method instead of plain old
read to prepare to supporting splice and sendfile again.

Signed-off-by: Christoph Hellwig
Link: https://lore.kernel.org/r/20210120204631.274206-2-hch@lst.de
Signed-off-by: Greg Kroah-Hartman
(cherry picked from commit 4eaad21a6ac9865df7f31983232ed5928450458d)
Signed-off-by: Greg Kroah-Hartman
Change-Id: I75af3b3bae22de7ba46ec38b8182be335d6760e8

Christoph Hellwig
2021-01-27 15:04:54 +0800

20 Jan, 2021

31 commits

88e2d5fd1 Merge 5.10.9 into android12-5.10 ... Browse Code »

Changes in 5.10.9
btrfs: reloc: fix wrong file extent type check to avoid false ENOENT
btrfs: prevent NULL pointer dereference in extent_io_tree_panic
ALSA: hda/realtek: fix right sounds and mute/micmute LEDs for HP machines
ALSA: doc: Fix reference to mixart.rst
ASoC: AMD Renoir - add DMI entry for Lenovo ThinkPad X395
ASoC: dapm: remove widget from dirty list on free
x86/hyperv: check cpu mask after interrupt has been disabled
drm/amdgpu: add green_sardine device id (v2)
drm/amdgpu: fix DRM_INFO flood if display core is not supported (bug 210921)
Revert "drm/amd/display: Fixed Intermittent blue screen on OLED panel"
drm/amdgpu: add new device id for Renior
drm/i915: Allow the sysadmin to override security mitigations
drm/i915/gt: Limit VFE threads based on GT
drm/i915/backlight: fix CPU mode backlight takeover on LPT
drm/bridge: sii902x: Refactor init code into separate function
dt-bindings: display: sii902x: Add supply bindings
drm/bridge: sii902x: Enable I/O and core VCC supplies if present
tracing/kprobes: Do the notrace functions check without kprobes on ftrace
tools/bootconfig: Add tracing_on support to helper scripts
ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR
ext4: fix wrong list_splice in ext4_fc_cleanup
ext4: fix bug for rename with RENAME_WHITEOUT
cifs: check pointer before freeing
cifs: fix interrupted close commands
riscv: Drop a duplicated PAGE_KERNEL_EXEC
riscv: return -ENOSYS for syscall -1
riscv: Fixup CONFIG_GENERIC_TIME_VSYSCALL
riscv: Fix KASAN memory mapping.
mips: fix Section mismatch in reference
mips: lib: uncached: fix non-standard usage of variable 'sp'
MIPS: boot: Fix unaligned access with CONFIG_MIPS_RAW_APPENDED_DTB
MIPS: Fix malformed NT_FILE and NT_SIGINFO in 32bit coredumps
MIPS: relocatable: fix possible boot hangup with KASLR enabled
RDMA/ocrdma: Fix use after free in ocrdma_dealloc_ucontext_pd()
ACPI: scan: Harden acpi_device_add() against device ID overflows
xen/privcmd: allow fetching resource sizes
compiler.h: Raise minimum version of GCC to 5.1 for arm64
mm/vmalloc.c: fix potential memory leak
mm/hugetlb: fix potential missing huge page size info
mm/process_vm_access.c: include compat.h
dm raid: fix discard limits for raid1
dm snapshot: flush merged data before committing metadata
dm integrity: fix flush with external metadata device
dm integrity: fix the maximum number of arguments
dm crypt: use GFP_ATOMIC when allocating crypto requests from softirq
dm crypt: do not wait for backlogged crypto request completion in softirq
dm crypt: do not call bio_endio() from the dm-crypt tasklet
dm crypt: defer decryption to a tasklet if interrupts disabled
stmmac: intel: change all EHL/TGL to auto detect phy addr
r8152: Add Lenovo Powered USB-C Travel Hub
btrfs: tree-checker: check if chunk item end overflows
ext4: don't leak old mountpoint samples
io_uring: don't take files/mm for a dead task
io_uring: drop mm and files after task_work_run
ARC: build: remove non-existing bootpImage from KBUILD_IMAGE
ARC: build: add uImage.lzma to the top-level target
ARC: build: add boot_targets to PHONY
ARC: build: move symlink creation to arch/arc/Makefile to avoid race
ARM: omap2: pmic-cpcap: fix maximum voltage to be consistent with defaults on xt875
ath11k: fix crash caused by NULL rx_channel
netfilter: ipset: fixes possible oops in mtype_resize
ath11k: qmi: try to allocate a big block of DMA memory first
btrfs: fix async discard stall
btrfs: merge critical sections of discard lock in workfn
btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan
regulator: bd718x7: Add enable times
ethernet: ucc_geth: fix definition and size of ucc_geth_tx_global_pram
ARM: dts: ux500/golden: Set display max brightness
habanalabs: adjust pci controller init to new firmware
habanalabs/gaudi: retry loading TPC f/w on -EINTR
habanalabs: register to pci shutdown callback
staging: spmi: hisi-spmi-controller: Fix some error handling paths
spi: altera: fix return value for altera_spi_txrx()
habanalabs: Fix memleak in hl_device_reset
hwmon: (pwm-fan) Ensure that calculation doesn't discard big period values
lib/raid6: Let $(UNROLL) rules work with macOS userland
kconfig: remove 'kvmconfig' and 'xenconfig' shorthands
spi: fix the divide by 0 error when calculating xfer waiting time
io_uring: drop file refs after task cancel
bfq: Fix computation of shallow depth
arch/arc: add copy_user_page() to to fix build error on ARC
misdn: dsp: select CONFIG_BITREVERSE
net: ethernet: fs_enet: Add missing MODULE_LICENSE
selftests: fix the return value for UDP GRO test
nvme-pci: mark Samsung PM1725a as IGNORE_DEV_SUBNQN
nvme: avoid possible double fetch in handling CQE
nvmet-rdma: Fix list_del corruption on queue establishment failure
drm/amd/display: fix sysfs amdgpu_current_backlight_pwm NULL pointer issue
drm/amdgpu: fix a GPU hang issue when remove device
drm/amd/pm: fix the failure when change power profile for renoir
drm/amdgpu: fix potential memory leak during navi12 deinitialization
usb: typec: Fix copy paste error for NVIDIA alt-mode description
iommu/vt-d: Fix lockdep splat in sva bind()/unbind()
ACPI: scan: add stub acpi_create_platform_device() for !CONFIG_ACPI
drm/msm: Call msm_init_vram before binding the gpu
ARM: picoxcell: fix missing interrupt-parent properties
poll: fix performance regression due to out-of-line __put_user()
rcu-tasks: Move RCU-tasks initialization to before early_initcall()
bpf: Simplify task_file_seq_get_next()
bpf: Save correct stopping point in file seq iteration
x86/sev-es: Fix SEV-ES OUT/IN immediate opcode vc handling
cfg80211: select CONFIG_CRC32
nvme-fc: avoid calling _nvme_fc_abort_outstanding_ios from interrupt context
iommu/vt-d: Update domain geometry in iommu_ops.at(de)tach_dev
net/mlx5e: CT: Use per flow counter when CT flow accounting is enabled
net/mlx5: Fix passing zero to 'PTR_ERR'
net/mlx5: E-Switch, fix changing vf VLANID
blk-mq-debugfs: Add decode for BLK_MQ_F_TAG_HCTX_SHARED
mm: fix clear_refs_write locking
mm: don't play games with pinned pages in clear_page_refs
mm: don't put pinned pages into the swap cache
perf intel-pt: Fix 'CPU too large' error
dump_common_audit_data(): fix racy accesses to ->d_name
ASoC: meson: axg-tdm-interface: fix loopback
ASoC: meson: axg-tdmin: fix axg skew offset
ASoC: Intel: fix error code cnl_set_dsp_D0()
nvmet-rdma: Fix NULL deref when setting pi_enable and traddr INADDR_ANY
nvme: don't intialize hwmon for discovery controllers
nvme-tcp: fix possible data corruption with bio merges
nvme-tcp: Fix warning with CONFIG_DEBUG_PREEMPT
NFS4: Fix use-after-free in trace_event_raw_event_nfs4_set_lock
pNFS: We want return-on-close to complete when evicting the inode
pNFS: Mark layout for return if return-on-close was not sent
pNFS: Stricter ordering of layoutget and layoutreturn
NFS: Adjust fs_context error logging
NFS/pNFS: Don't call pnfs_free_bucket_lseg() before removing the request
NFS/pNFS: Don't leak DS commits in pnfs_generic_retry_commit()
NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter
NFS: nfs_delegation_find_inode_server must first reference the superblock
NFS: nfs_igrab_and_active must first reference the superblock
scsi: ufs: Fix possible power drain during system suspend
ext4: fix superblock checksum failure when setting password salt
RDMA/restrack: Don't treat as an error allocation ID wrapping
RDMA/usnic: Fix memleak in find_free_vf_and_create_qp_grp
bnxt_en: Improve stats context resource accounting with RDMA driver loaded.
RDMA/mlx5: Fix wrong free of blue flame register on error
IB/mlx5: Fix error unwinding when set_has_smi_cap fails
umount(2): move the flag validity checks first
dm zoned: select CONFIG_CRC32
drm/i915/dsi: Use unconditional msleep for the panel_on_delay when there is no reset-deassert MIPI-sequence
drm/i915/icl: Fix initing the DSI DSC power refcount during HW readout
drm/i915/gt: Restore clear-residual mitigations for Ivybridge, Baytrail
mm, slub: consider rest of partial list if acquire_slab() fails
riscv: Trace irq on only interrupt is enabled
iommu/vt-d: Fix unaligned addresses for intel_flush_svm_range_dev()
net: sunrpc: interpret the return value of kstrtou32 correctly
selftests: netfilter: Pass family parameter "-f" to conntrack tool
dm: eliminate potential source of excessive kernel log noise
ALSA: fireface: Fix integer overflow in transmit_midi_msg()
ALSA: firewire-tascam: Fix integer overflow in midi_port_work()
netfilter: conntrack: fix reading nf_conntrack_buckets
netfilter: nf_nat: Fix memleak in nf_nat_init
Linux 5.10.9

Signed-off-by: Greg Kroah-Hartman
Change-Id: I609e501511889081e03d2d18ee7e1be95406f396

Greg Kroah-Hartman
2021-01-20 01:49:54 +0800
c6dc4f8e6 umount(2): move the flag validity checks first ... Browse Code »

commit a0a6df9afcaf439a6b4c88a3b522e3d05fdef46f upstream.

Unfortunately, there's userland code that used to rely upon these
checks being done before anything else to check for UMOUNT_NOFOLLOW
support. That broke in 41525f56e256 ("fs: refactor ksys_umount").
Separate those from the rest of checks and move them to ksys_umount();
unlike everything else in there, this can be sanely done there.

Reported-by: Sargun Dhillon
Fixes: 41525f56e256 ("fs: refactor ksys_umount")
Signed-off-by: Al Viro
Signed-off-by: Greg Kroah-Hartman

Al Viro
2021-01-20 01:27:32 +0800
cd223237e ext4: fix superblock checksum failure when setting password salt ... Browse Code »

commit dfd56c2c0c0dbb11be939b804ddc8d5395ab3432 upstream.

When setting password salt in the superblock, we forget to recompute the
superblock checksum so it will not match until the next superblock
modification which recomputes the checksum. Fix it.

CC: Michael Halcrow
Reported-by: Andreas Dilger
Fixes: 9bd8212f981e ("ext4 crypto: add encryption policy and password salt support")
Signed-off-by: Jan Kara
Link: https://lore.kernel.org/r/20201216101844.22917-8-jack@suse.cz
Signed-off-by: Theodore Ts'o
Signed-off-by: Greg Kroah-Hartman

Jan Kara
2021-01-20 01:27:31 +0800
51121ea1d NFS: nfs_igrab_and_active must first reference the superblock ... Browse Code »

commit 896567ee7f17a8a736cda8a28cc987228410a2ac upstream.

Before referencing the inode, we must ensure that the superblock can be
referenced. Otherwise, we can end up with iput() calling superblock
operations that are no longer valid or accessible.

Fixes: ea7c38fef0b7 ("NFSv4: Ensure we reference the inode for return-on-close in delegreturn")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2021-01-20 01:27:31 +0800
b4689562f NFS: nfs_delegation_find_inode_server must first reference the superblock ... Browse Code »

commit 113aac6d567bda783af36d08f73bfda47d8e9a40 upstream.

Before referencing the inode, we must ensure that the superblock can be
referenced. Otherwise, we can end up with iput() calling superblock
operations that are no longer valid or accessible.

Fixes: e39d8a186ed0 ("NFSv4: Fix an Oops during delegation callbacks")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2021-01-20 01:27:31 +0800
01a12a24f NFS/pNFS: Fix a leak of the layout 'plh_outstanding' counter ... Browse Code »

commit cb2856c5971723910a86b7d1d0cf623d6919cbc4 upstream.

If we exit _lgopen_prepare_attached() without setting a layout, we will
currently leak the plh_outstanding counter.

Fixes: 411ae722d10a ("pNFS: Wait for stale layoutget calls to complete in pnfs_update_layout()")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2021-01-20 01:27:31 +0800
b666f394d NFS/pNFS: Don't leak DS commits in pnfs_generic_retry_commit() ... Browse Code »

commit 46c9ea1d4fee4cf1f8cc6001b9c14aae61b3d502 upstream.

We must ensure that we pass a layout segment to nfs_retry_commit() when
we're cleaning up after pnfs_bucket_alloc_ds_commits(). Otherwise,
requests that should be committed to the DS will get committed to the
MDS.
Do so by ensuring that pnfs_bucket_get_committing() always tries to
return a layout segment when it returns a non-empty page list.

Fixes: c84bea59449a ("NFS/pNFS: Simplify bucket layout segment reference counting")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2021-01-20 01:27:31 +0800
067aefcdf NFS/pNFS: Don't call pnfs_free_bucket_lseg() before removing the request ... Browse Code »

commit 1757655d780d9d29bc4b60e708342e94924f7ef3 upstream.

In pnfs_generic_clear_request_commit(), we try calling
pnfs_free_bucket_lseg() before we remove the request from the DS bucket.
That will always fail, since the point is to test for whether or not
that bucket is empty.

Fixes: c84bea59449a ("NFS/pNFS: Simplify bucket layout segment reference counting")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2021-01-20 01:27:31 +0800
e6ae16467 NFS: Adjust fs_context error logging ... Browse Code »

commit c98e9daa59a611ff4e163689815f40380c912415 upstream.

Several existing dprink()/dfprintk() calls were converted to use the new
mount API logging macros by commit ce8866f0913f ("NFS: Attach
supplementary error information to fs_context"). If the fs_context was
not created using fsopen() then it will not have had a log buffer
allocated for it, and the new mount API logging macros will wind up
calling printk().

This can result in syslog messages being logged where previously there
were none... most notably "NFS4: Couldn't follow remote path", which can
happen if the client is auto-negotiating a protocol version with an NFS
server that doesn't support the higher v4.x versions.

Convert the nfs_errorf(), nfs_invalf(), and nfs_warnf() macros to check
for the existence of the fs_context's log buffer and call dprintk() if
it doesn't exist. Add nfs_ferrorf(), nfs_finvalf(), and nfs_warnf(),
which do the same thing but take an NFS debug flag as an argument and
call dfprintk(). Finally, modify the "NFS4: Couldn't follow remote
path" message to use nfs_ferrorf().

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=207385
Signed-off-by: Scott Mayhew
Reviewed-by: Benjamin Coddington
Fixes: ce8866f0913f ("NFS: Attach supplementary error information to fs_context.")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Scott Mayhew
2021-01-20 01:27:30 +0800
06f58dbc4 pNFS: Stricter ordering of layoutget and layoutreturn ... Browse Code »

commit 2c8d5fc37fe2384a9bdb6965443ab9224d46f704 upstream.

If a layout return is in progress, we should wait for it to complete,
in case the layout segment we are picking up gets returned too.

Fixes: 30cb3ee299cb ("pNFS: Handle NFS4ERR_OLD_STATEID on layoutreturn by bumping the state seqid")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2021-01-20 01:27:30 +0800
ecaaad180 pNFS: Mark layout for return if return-on-close was not sent ... Browse Code »

commit 67bbceedc9bb8ad48993a8bd6486054756d711f4 upstream.

If the layout return-on-close failed because the layoutreturn was never
sent, then we should mark the layout for return again.

Fixes: 9c47b18cf722 ("pNFS: Ensure we do clear the return-on-close layout stateid on fatal errors")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2021-01-20 01:27:30 +0800
f128de17c pNFS: We want return-on-close to complete when evicting the inode ... Browse Code »

commit 078000d02d57f02dde61de4901f289672e98c8bc upstream.

If the inode is being evicted, it should be safe to run return-on-close,
so we should do it to ensure we don't inadvertently leak layout segments.

Fixes: 1c5bd76d17cc ("pNFS: Enable layoutreturn operation for return-on-close")
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Trond Myklebust
2021-01-20 01:27:30 +0800
1b42712e4 NFS4: Fix use-after-free in trace_event_raw_event_nfs4_set_lock ... Browse Code »

commit 3d1a90ab0ed93362ec8ac85cf291243c87260c21 upstream.

It is only safe to call the tracepoint before rpc_put_task() because
'data' is freed inside nfs4_lock_release (rpc_release).

Fixes: 48c9579a1afe ("Adding stateid information to tracepoints")
Signed-off-by: Dave Wysochanski
Signed-off-by: Trond Myklebust
Signed-off-by: Greg Kroah-Hartman

Dave Wysochanski
2021-01-20 01:27:30 +0800
1eea10899 mm: don't play games with pinned pages in clear_page_refs ... Browse Code »

[ Upstream commit 9348b73c2e1bfea74ccd4a44fb4ccc7276ab9623 ]

Turning a pinned page read-only breaks the pinning after COW. Don't do it.

The whole "track page soft dirty" state doesn't work with pinned pages
anyway, since the page might be dirtied by the pinning entity without
ever being noticed in the page tables.

Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Linus Torvalds
2021-01-20 01:27:29 +0800
41b0b0c09 mm: fix clear_refs_write locking ... Browse Code »

[ Upstream commit 29a951dfb3c3263c3a0f3bd9f7f2c2cfde4baedb ]

Turning page table entries read-only requires the mmap_sem held for
writing.

So stop doing the odd games with turning things from read locks to write
locks and back. Just get the write lock.

Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Linus Torvalds
2021-01-20 01:27:29 +0800
bc880f204 poll: fix performance regression due to out-of-line __put_user() ... Browse Code »

[ Upstream commit ef0ba05538299f1391cbe097de36895bb36ecfe6 ]

The kernel test robot reported a -5.8% performance regression on the
"poll2" test of will-it-scale, and bisected it to commit d55564cfc222
("x86: Make __put_user() generate an out-of-line call").

I didn't expect an out-of-line __put_user() to matter, because no normal
core code should use that non-checking legacy version of user access any
more. But I had overlooked the very odd poll() usage, which does a
__put_user() to update the 'revents' values of the poll array.

Now, Al Viro correctly points out that instead of updating just the
'revents' field, it would be much simpler to just copy the _whole_
pollfd entry, and then we could just use "copy_to_user()" on the whole
array of entries, the same way we use "copy_from_user()" a few lines
earlier to get the original values.

But that is not what we've traditionally done, and I worry that threaded
applications might be concurrently modifying the other fields of the
pollfd array. So while Al's suggestion is simpler - and perhaps worth
trying in the future - this instead keeps the "just update revents"
model.

To fix the performance regression, use the modern "unsafe_put_user()"
instead of __put_user(), with the proper "user_write_access_begin()"
guarding in place. This improves code generation enormously.

Link: https://lore.kernel.org/lkml/20210107134723.GA28532@xsang-OptiPlex-9020/
Reported-by: kernel test robot
Tested-by: Oliver Sang
Cc: Al Viro
Cc: David Laight
Cc: Peter Zijlstra
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Linus Torvalds
2021-01-20 01:27:27 +0800
94dbb87fc io_uring: drop file refs after task cancel ... Browse Code »

[ Upstream commit de7f1d9e99d8b99e4e494ad8fcd91f0c4c5c9357 ]

io_uring fds marked O_CLOEXEC and we explicitly cancel all requests
before going through exec, so we don't want to leave task's file
references to not our anymore io_uring instances.

Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Pavel Begunkov
2021-01-20 01:27:25 +0800
29543864c btrfs: fix transaction leak and crash after RO remount caused by qgroup rescan ... Browse Code »

[ Upstream commit cb13eea3b49055bd78e6ddf39defd6340f7379fc ]

If we remount a filesystem in RO mode while the qgroup rescan worker is
running, we can end up having it still running after the remount is done,
and at unmount time we may end up with an open transaction that ends up
never getting committed. If that happens we end up with several memory
leaks and can crash when hardware acceleration is unavailable for crc32c.
Possibly it can lead to other nasty surprises too, due to use-after-free
issues.

The following steps explain how the problem happens.

1) We have a filesystem mounted in RW mode and the qgroup rescan worker is
running;

2) We remount the filesystem in RO mode, and never stop/pause the rescan
worker, so after the remount the rescan worker is still running. The
important detail here is that the rescan task is still running after
the remount operation committed any ongoing transaction through its
call to btrfs_commit_super();

3) The rescan is still running, and after the remount completed, the
rescan worker started a transaction, after it finished iterating all
leaves of the extent tree, to update the qgroup status item in the
quotas tree. It does not commit the transaction, it only releases its
handle on the transaction;

4) A filesystem unmount operation starts shortly after;

5) The unmount task, at close_ctree(), stops the transaction kthread,
which had not had a chance to commit the open transaction since it was
sleeping and the commit interval (default of 30 seconds) has not yet
elapsed since the last time it committed a transaction;

6) So after stopping the transaction kthread we still have the transaction
used to update the qgroup status item open. At close_ctree(), when the
filesystem is in RO mode and no transaction abort happened (or the
filesystem is in error mode), we do not expect to have any transaction
open, so we do not call btrfs_commit_super();

7) We then proceed to destroy the work queues, free the roots and block
groups, etc. After that we drop the last reference on the btree inode
by calling iput() on it. Since there are dirty pages for the btree
inode, corresponding to the COWed extent buffer for the quotas btree,
btree_write_cache_pages() is invoked to flush those dirty pages. This
results in creating a bio and submitting it, which makes us end up at
btrfs_submit_metadata_bio();

8) At btrfs_submit_metadata_bio() we end up at the if-then-else branch
that calls btrfs_wq_submit_bio(), because check_async_write() returned
a value of 1. This value of 1 is because we did not have hardware
acceleration available for crc32c, so BTRFS_FS_CSUM_IMPL_FAST was not
set in fs_info->flags;

9) Then at btrfs_wq_submit_bio() we call btrfs_queue_work() against the
workqueue at fs_info->workers, which was already freed before by the
call to btrfs_stop_all_workers() at close_ctree(). This results in an
invalid memory access due to a use-after-free, leading to a crash.

When this happens, before the crash there are several warnings triggered,
since we have reserved metadata space in a block group, the delayed refs
reservation, etc:

------------[ cut here ]------------
WARNING: CPU: 4 PID: 1729896 at fs/btrfs/block-group.c:125 btrfs_put_block_group+0x63/0xa0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 4 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_put_block_group+0x63/0xa0 [btrfs]
Code: f0 01 00 00 48 39 c2 75 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 0000000000000001 RBX: ffff947ed73e4000 RCX: ffff947ebc8b29c8
RDX: 0000000000000001 RSI: ffffffffc0b150a0 RDI: ffff947ebc8b2800
RBP: ffff947ebc8b2800 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e4160 R14: ffff947ebc8b2988 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f37e2893320 CR3: 0000000138f68001 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x17f/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 48 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [] 0x0
hardirqs last disabled at (0): [] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [] 0x0
---[ end trace dd74718fef1ed5c6 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-rsv.c:459 btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 2 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_release_global_block_rsv+0x70/0xc0 [btrfs]
Code: 48 83 bb b0 03 00 00 00 (...)
RSP: 0018:ffffb270826bbdd8 EFLAGS: 00010206
RAX: 000000000033c000 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffffffffc0b0d8c1 RDI: 00000000ffffffff
RBP: ffff947ebc8b7000 R08: 0000000000000001 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ed73e4110
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481aca00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000561a79f76e20 CR3: 0000000138f68006 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_free_block_groups+0x24c/0x2f0 [btrfs]
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [] 0x0
hardirqs last disabled at (0): [] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [] 0x0
---[ end trace dd74718fef1ed5c7 ]---
------------[ cut here ]------------
WARNING: CPU: 2 PID: 1729896 at fs/btrfs/block-group.c:3377 btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
CPU: 5 PID: 1729896 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_free_block_groups+0x25d/0x2f0 [btrfs]
Code: ad de 49 be 22 01 00 (...)
RSP: 0018:ffffb270826bbde8 EFLAGS: 00010206
RAX: ffff947ebeae1d08 RBX: ffff947ed73e4000 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff947e9d823ae8 RDI: 0000000000000246
RBP: ffff947ebeae1d08 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff947ebeae1c00
R13: ffff947ed73e5278 R14: dead000000000122 R15: dead000000000100
FS: 00007f15edfea840(0000) GS:ffff9481ad200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1475d98ea8 CR3: 0000000138f68005 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
close_ctree+0x2ba/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f15ee221ee7
Code: ff 0b 00 f7 d8 64 89 (...)
RSP: 002b:00007ffe9470f0f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f15ee347264 RCX: 00007f15ee221ee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 000056169701d000
RBP: 0000561697018a30 R08: 0000000000000000 R09: 00007f15ee2e2be0
R10: 000056169701efe0 R11: 0000000000000246 R12: 0000000000000000
R13: 000056169701d000 R14: 0000561697018b40 R15: 0000561697018c60
irq event stamp: 0
hardirqs last enabled at (0): [] 0x0
hardirqs last disabled at (0): [] copy_process+0x8a0/0x1d70
softirqs last enabled at (0): [] copy_process+0x8a0/0x1d70
softirqs last disabled at (0): [] 0x0
---[ end trace dd74718fef1ed5c8 ]---
BTRFS info (device sdc): space_info 4 has 268238848 free, is not full
BTRFS info (device sdc): space_info total=268435456, used=114688, pinned=0, reserved=16384, may_use=0, readonly=65536
BTRFS info (device sdc): global_block_rsv: size 0 reserved 0
BTRFS info (device sdc): trans_block_rsv: size 0 reserved 0
BTRFS info (device sdc): chunk_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_block_rsv: size 0 reserved 0
BTRFS info (device sdc): delayed_refs_rsv: size 524288 reserved 0

And the crash, which only happens when we do not have crc32c hardware
acceleration, produces the following trace immediately after those
warnings:

stack segment: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
CPU: 2 PID: 1749129 Comm: umount Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:btrfs_queue_work+0x36/0x190 [btrfs]
Code: 54 55 53 48 89 f3 (...)
RSP: 0018:ffffb27082443ae8 EFLAGS: 00010282
RAX: 0000000000000004 RBX: ffff94810ee9ad90 RCX: 0000000000000000
RDX: 0000000000000001 RSI: ffff94810ee9ad90 RDI: ffff947ed8ee75a0
RBP: a56b6b6b6b6b6b6b R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000007 R11: 0000000000000001 R12: ffff947fa9b435a8
R13: ffff94810ee9ad90 R14: 0000000000000000 R15: ffff947e93dc0000
FS: 00007f3cfe974840(0000) GS:ffff9481ac600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1b42995a70 CR3: 0000000127638003 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
btrfs_wq_submit_bio+0xb3/0xd0 [btrfs]
btrfs_submit_metadata_bio+0x44/0xc0 [btrfs]
submit_one_bio+0x61/0x70 [btrfs]
btree_write_cache_pages+0x414/0x450 [btrfs]
? kobject_put+0x9a/0x1d0
? trace_hardirqs_on+0x1b/0xf0
? _raw_spin_unlock_irqrestore+0x3c/0x60
? free_debug_processing+0x1e1/0x2b0
do_writepages+0x43/0xe0
? lock_acquired+0x199/0x490
__writeback_single_inode+0x59/0x650
writeback_single_inode+0xaf/0x120
write_inode_now+0x94/0xd0
iput+0x187/0x2b0
close_ctree+0x2c6/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f3cfebabee7
Code: ff 0b 00 f7 d8 64 89 01 (...)
RSP: 002b:00007ffc9c9a05f8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a6
RAX: 0000000000000000 RBX: 00007f3cfecd1264 RCX: 00007f3cfebabee7
RDX: ffffffffffffff78 RSI: 0000000000000000 RDI: 0000562b6b478000
RBP: 0000562b6b473a30 R08: 0000000000000000 R09: 00007f3cfec6cbe0
R10: 0000562b6b479fe0 R11: 0000000000000246 R12: 0000000000000000
R13: 0000562b6b478000 R14: 0000562b6b473b40 R15: 0000562b6b473c60
Modules linked in: btrfs dm_snapshot dm_thin_pool (...)
---[ end trace dd74718fef1ed5cc ]---

Finally when we remove the btrfs module (rmmod btrfs), there are several
warnings about objects that were allocated from our slabs but were never
freed, consequence of the transaction that was never committed and got
leaked:

=============================================================================
BUG btrfs_delayed_ref_head (Tainted: G B W ): Objects remaining in btrfs_delayed_ref_head on __kmem_cache_shutdown()
-----------------------------------------------------------------------------

INFO: Slab 0x0000000094c2ae56 objects=24 used=2 fp=0x000000002bfa2521 flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x0000000050cbdd61 @offset=12104
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1894 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=4292 cpu=2 pid=1729526
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
sync_filesystem+0x74/0x90
generic_shutdown_super+0x22/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x0000000086e9b0ff @offset=12776
INFO: Allocated in btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs] age=1900 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0xbb/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x1117/0x1290 [btrfs] age=3141 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x1117/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_write_dirty_block_groups+0x17d/0x3d0 [btrfs]
commit_cowonly_roots+0x248/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_ref_head: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x11/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 0b (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_tree_ref (Tainted: G B W ): Objects remaining in btrfs_delayed_tree_ref on __kmem_cache_shutdown()
-----------------------------------------------------------------------------

INFO: Slab 0x0000000011f78dc0 objects=37 used=2 fp=0x0000000032d55d91 flags=0x17fffc000010200
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? lock_release+0x20e/0x4c0
kmem_cache_destroy+0x55/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000001a340018 @offset=4408
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1917 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_free_tree_block+0x128/0x360 [btrfs]
__btrfs_cow_block+0x489/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=4167 cpu=4 pid=1729795
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
btrfs_commit_transaction+0x60/0xc40 [btrfs]
create_subvol+0x56a/0x990 [btrfs]
btrfs_mksubvol+0x3fb/0x4a0 [btrfs]
__btrfs_ioctl_snap_create+0x119/0x1a0 [btrfs]
btrfs_ioctl_snap_create+0x58/0x80 [btrfs]
btrfs_ioctl+0x1a92/0x36f0 [btrfs]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
INFO: Object 0x000000002b46292a @offset=13648
INFO: Allocated in btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs] age=1923 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_add_delayed_tree_ref+0x9e/0x480 [btrfs]
btrfs_alloc_tree_block+0x2bf/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
INFO: Freed in __btrfs_run_delayed_refs+0x63d/0x1290 [btrfs] age=3164 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0x63d/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_tree_ref: Slab cache still has objects
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
btrfs_delayed_ref_exit+0x1d/0x35 [btrfs]
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
=============================================================================
BUG btrfs_delayed_extent_op (Tainted: G B W ): Objects remaining in btrfs_delayed_extent_op on __kmem_cache_shutdown()
-----------------------------------------------------------------------------

INFO: Slab 0x00000000f145ce2f objects=22 used=1 fp=0x00000000af0f92cf flags=0x17fffc000010200
CPU: 5 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
slab_err+0xb7/0xdc
? lock_acquired+0x199/0x490
__kmem_cache_shutdown+0x1ac/0x3c0
? __mutex_unlock_slowpath+0x45/0x2a0
kmem_cache_destroy+0x55/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 f5 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
INFO: Object 0x000000004cf95ea8 @offset=6264
INFO: Allocated in btrfs_alloc_tree_block+0x1e0/0x360 [btrfs] age=1931 cpu=6 pid=1729873
__slab_alloc.isra.0+0x109/0x1c0
kmem_cache_alloc+0x7bb/0x830
btrfs_alloc_tree_block+0x1e0/0x360 [btrfs]
alloc_tree_block_no_bg_flush+0x4f/0x60 [btrfs]
__btrfs_cow_block+0x12d/0x5f0 [btrfs]
btrfs_cow_block+0xf7/0x220 [btrfs]
btrfs_search_slot+0x62a/0xc40 [btrfs]
btrfs_del_orphan_item+0x65/0xd0 [btrfs]
btrfs_find_orphan_roots+0x1bf/0x200 [btrfs]
open_ctree+0x125a/0x18a0 [btrfs]
btrfs_mount_root.cold+0x13/0xed [btrfs]
legacy_get_tree+0x30/0x60
vfs_get_tree+0x28/0xe0
fc_mount+0xe/0x40
vfs_kern_mount.part.0+0x71/0x90
btrfs_mount+0x13b/0x3e0 [btrfs]
INFO: Freed in __btrfs_run_delayed_refs+0xabd/0x1290 [btrfs] age=3173 cpu=6 pid=1729803
kmem_cache_free+0x34c/0x3c0
__btrfs_run_delayed_refs+0xabd/0x1290 [btrfs]
btrfs_run_delayed_refs+0x81/0x210 [btrfs]
commit_cowonly_roots+0xfb/0x300 [btrfs]
btrfs_commit_transaction+0x367/0xc40 [btrfs]
close_ctree+0x113/0x2fa [btrfs]
generic_shutdown_super+0x6c/0x100
kill_anon_super+0x14/0x30
btrfs_kill_super+0x12/0x20 [btrfs]
deactivate_locked_super+0x31/0x70
cleanup_mnt+0x100/0x160
task_work_run+0x68/0xb0
exit_to_user_mode_prepare+0x1bb/0x1c0
syscall_exit_to_user_mode+0x4b/0x260
entry_SYSCALL_64_after_hwframe+0x44/0xa9
kmem_cache_destroy btrfs_delayed_extent_op: Slab cache still has objects
CPU: 3 PID: 1729921 Comm: rmmod Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x8d/0xb5
kmem_cache_destroy+0x119/0x120
exit_btrfs_fs+0xa/0x59 [btrfs]
__x64_sys_delete_module+0x194/0x260
? fpregs_assert_state_consistent+0x1e/0x40
? exit_to_user_mode_prepare+0x55/0x1c0
? trace_hardirqs_on+0x1b/0xf0
do_syscall_64+0x33/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f693e305897
Code: 73 01 c3 48 8b 0d f9 (...)
RSP: 002b:00007ffcf73eb508 EFLAGS: 00000206 ORIG_RAX: 00000000000000b0
RAX: ffffffffffffffda RBX: 0000559df504f760 RCX: 00007f693e305897
RDX: 000000000000000a RSI: 0000000000000800 RDI: 0000559df504f7c8
RBP: 00007ffcf73eb568 R08: 0000000000000000 R09: 0000000000000000
R10: 00007f693e378ac0 R11: 0000000000000206 R12: 00007ffcf73eb740
R13: 00007ffcf73ec5a6 R14: 0000559df504f2a0 R15: 0000559df504f760
BTRFS: state leak: start 30408704 end 30425087 state 1 in tree 1 refs 1

Fix this issue by having the remount path stop the qgroup rescan worker
when we are remounting RO and teach the rescan worker to stop when a
remount is in progress. If later a remount in RW mode happens, we are
already resuming the qgroup rescan worker through the call to
btrfs_qgroup_rescan_resume(), so we do not need to worry about that.

Tested-by: Fabian Vogt
Reviewed-by: Josef Bacik
Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin

Filipe Manana
2021-01-20 01:27:24 +0800
f89d84b35 btrfs: merge critical sections of discard lock in workfn ... Browse Code »

[ Upstream commit 8fc058597a283e9a37720abb0e8d68e342b9387d ]

btrfs_discard_workfn() drops discard_ctl->lock just to take it again in
a moment in btrfs_discard_schedule_work(). Avoid that and also reuse
ktime.

Reviewed-by: Josef Bacik
Signed-off-by: Pavel Begunkov
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin

Pavel Begunkov
2021-01-20 01:27:24 +0800
33061bd10 btrfs: fix async discard stall ... Browse Code »

[ Upstream commit ea9ed87c73e87e044b2c58d658eb4ba5216bc488 ]

Might happen that bg->discard_eligible_time was changed without
rescheduling, so btrfs_discard_workfn() wakes up earlier than that new
time, peek_discard_list() returns NULL, and all work halts and goes to
sleep without further rescheduling even there are block groups to
discard.

It happens pretty often, but not so visible from the userspace because
after some time it usually will be kicked off anyway by someone else
calling btrfs_discard_reschedule_work().

Fix it by continue rescheduling if block group discard lists are not
empty.

Reviewed-by: Josef Bacik
Signed-off-by: Pavel Begunkov
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin

Pavel Begunkov
2021-01-20 01:27:24 +0800
f7f32822a io_uring: drop mm and files after task_work_run ... Browse Code »

[ Upstream commit d434ab6db524ab1efd0afad4ffa1ee65ca6ac097 ]

__io_req_task_submit() run by task_work can set mm and files, but
io_sq_thread() in some cases, and because __io_sq_thread_acquire_mm()
and __io_sq_thread_acquire_files() do a simple current->mm/files check
it may end up submitting IO with mm/files of another task.

We also need to drop it after in the end to drop potentially grabbed
references to them.

Cc: stable@vger.kernel.org # 5.9+
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Pavel Begunkov
2021-01-20 01:27:23 +0800
a3647cddf io_uring: don't take files/mm for a dead task ... Browse Code »

[ Upstream commit 621fadc22365f3cf307bcd9048e3372e9ee9cdcc ]

In rare cases a task may be exiting while io_ring_exit_work() trying to
cancel/wait its requests. It's ok for __io_sq_thread_acquire_mm()
because of SQPOLL check, but is not for __io_sq_thread_acquire_files().
Play safe and fail for both of them.

Cc: stable@vger.kernel.org # 5.5+
Signed-off-by: Pavel Begunkov
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Pavel Begunkov
2021-01-20 01:27:23 +0800
85958f60e ext4: don't leak old mountpoint samples ... Browse Code »

[ Upstream commit 5a3b590d4b2db187faa6f06adc9a53d6199fb1f9 ]

When the first file is opened, ext4 samples the mountpoint of the
filesystem in 64 bytes of the super block. It does so using
strlcpy(), this means that the remaining bytes in the super block
string buffer are untouched. If the mount point before had a longer
path than the current one, it can be reconstructed.

Consider the case where the fs was mounted to "/media/johnjdeveloper"
and later to "/". The super block buffer then contains
"/\x00edia/johnjdeveloper".

This case was seen in the wild and caused confusion how the name
of a developer ands up on the super block of a filesystem used
in production...

Fix this by using strncpy() instead of strlcpy(). The superblock
field is defined to be a fixed-size char array, and it is already
marked using __nonstring in fs/ext4/ext4.h. The consumer of the field
in e2fsprogs already assumes that in the case of a 64+ byte mount
path, that s_last_mounted will not be NUL terminated.

Link: https://lore.kernel.org/r/X9ujIOJG/HqMr88R@mit.edu
Reported-by: Richard Weinberger
Signed-off-by: Theodore Ts'o
Cc: stable@kernel.org
Signed-off-by: Sasha Levin

Theodore Ts'o
2021-01-20 01:27:22 +0800
41b5ec745 btrfs: tree-checker: check if chunk item end overflows ... Browse Code »

[ Upstream commit 347fb0cfc9bab5195c6701e62eda488310d7938f ]

While mounting a crafted image provided by user, kernel panics due to
the invalid chunk item whose end is less than start.

[66.387422] loop: module loaded
[66.389773] loop0: detected capacity change from 262144 to 0
[66.427708] BTRFS: device fsid a62e00e8-e94e-4200-8217-12444de93c2e devid 1 transid 12 /dev/loop0 scanned by mount (613)
[66.431061] BTRFS info (device loop0): disk space caching is enabled
[66.431078] BTRFS info (device loop0): has skinny extents
[66.437101] BTRFS error: insert state: end < start 29360127 37748736
[66.437136] ------------[ cut here ]------------
[66.437140] WARNING: CPU: 16 PID: 613 at fs/btrfs/extent_io.c:557 insert_state.cold+0x1a/0x46 [btrfs]
[66.437369] CPU: 16 PID: 613 Comm: mount Tainted: G O 5.11.0-rc1-custom #45
[66.437374] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[66.437378] RIP: 0010:insert_state.cold+0x1a/0x46 [btrfs]
[66.437420] RSP: 0018:ffff93e5414c3908 EFLAGS: 00010286
[66.437427] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
[66.437431] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
[66.437434] RBP: ffff93e5414c3938 R08: 0000000000000001 R09: 0000000000000001
[66.437438] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d72aa0
[66.437441] R13: ffff8ec78bc71628 R14: 0000000000000000 R15: 0000000002400000
[66.437447] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
[66.437451] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[66.437455] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
[66.437460] PKRU: 55555554
[66.437464] Call Trace:
[66.437475] set_extent_bit+0x652/0x740 [btrfs]
[66.437539] set_extent_bits_nowait+0x1d/0x20 [btrfs]
[66.437576] add_extent_mapping+0x1e0/0x2f0 [btrfs]
[66.437621] read_one_chunk+0x33c/0x420 [btrfs]
[66.437674] btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
[66.437708] ? kvm_sched_clock_read+0x18/0x40
[66.437739] open_ctree+0xb32/0x1734 [btrfs]
[66.437781] ? bdi_register_va+0x1b/0x20
[66.437788] ? super_setup_bdi_name+0x79/0xd0
[66.437810] btrfs_mount_root.cold+0x12/0xeb [btrfs]
[66.437854] ? __kmalloc_track_caller+0x217/0x3b0
[66.437873] legacy_get_tree+0x34/0x60
[66.437880] vfs_get_tree+0x2d/0xc0
[66.437888] vfs_kern_mount.part.0+0x78/0xc0
[66.437897] vfs_kern_mount+0x13/0x20
[66.437902] btrfs_mount+0x11f/0x3c0 [btrfs]
[66.437940] ? kfree+0x5ff/0x670
[66.437944] ? __kmalloc_track_caller+0x217/0x3b0
[66.437962] legacy_get_tree+0x34/0x60
[66.437974] vfs_get_tree+0x2d/0xc0
[66.437983] path_mount+0x48c/0xd30
[66.437998] __x64_sys_mount+0x108/0x140
[66.438011] do_syscall_64+0x38/0x50
[66.438018] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[66.438023] RIP: 0033:0x7f0138827f6e
[66.438033] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[66.438040] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
[66.438044] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
[66.438047] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
[66.438050] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[66.438054] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
[66.438078] irq event stamp: 18169
[66.438082] hardirqs last enabled at (18175): [] console_unlock+0x4ff/0x5f0
[66.438088] hardirqs last disabled at (18180): [] console_unlock+0x467/0x5f0
[66.438092] softirqs last enabled at (16910): [] asm_call_irq_on_stack+0x12/0x20
[66.438097] softirqs last disabled at (16905): [] asm_call_irq_on_stack+0x12/0x20
[66.438103] ---[ end trace e114b111db64298b ]---
[66.438107] BTRFS error: found node 12582912 29360127 on insert of 37748736 29360127
[66.438127] BTRFS critical: panic in extent_io_tree_panic:679: locking error: extent tree was modified by another thread while locked (errno=-17 Object already exists)
[66.441069] ------------[ cut here ]------------
[66.441072] kernel BUG at fs/btrfs/extent_io.c:679!
[66.442064] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[66.443018] CPU: 16 PID: 613 Comm: mount Tainted: G W O 5.11.0-rc1-custom #45
[66.444538] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ArchLinux 1.14.0-1 04/01/2014
[66.446223] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
[66.450878] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
[66.451840] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
[66.453141] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
[66.454445] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
[66.455743] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
[66.457055] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
[66.458356] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
[66.459841] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[66.460895] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
[66.462196] PKRU: 55555554
[66.462692] Call Trace:
[66.463139] set_extent_bit.cold+0x30/0x98 [btrfs]
[66.464049] set_extent_bits_nowait+0x1d/0x20 [btrfs]
[66.490466] add_extent_mapping+0x1e0/0x2f0 [btrfs]
[66.514097] read_one_chunk+0x33c/0x420 [btrfs]
[66.534976] btrfs_read_chunk_tree+0x6a4/0x870 [btrfs]
[66.555718] ? kvm_sched_clock_read+0x18/0x40
[66.575758] open_ctree+0xb32/0x1734 [btrfs]
[66.595272] ? bdi_register_va+0x1b/0x20
[66.614638] ? super_setup_bdi_name+0x79/0xd0
[66.633809] btrfs_mount_root.cold+0x12/0xeb [btrfs]
[66.652938] ? __kmalloc_track_caller+0x217/0x3b0
[66.671925] legacy_get_tree+0x34/0x60
[66.690300] vfs_get_tree+0x2d/0xc0
[66.708221] vfs_kern_mount.part.0+0x78/0xc0
[66.725808] vfs_kern_mount+0x13/0x20
[66.742730] btrfs_mount+0x11f/0x3c0 [btrfs]
[66.759350] ? kfree+0x5ff/0x670
[66.775441] ? __kmalloc_track_caller+0x217/0x3b0
[66.791750] legacy_get_tree+0x34/0x60
[66.807494] vfs_get_tree+0x2d/0xc0
[66.823349] path_mount+0x48c/0xd30
[66.838753] __x64_sys_mount+0x108/0x140
[66.854412] do_syscall_64+0x38/0x50
[66.869673] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[66.885093] RIP: 0033:0x7f0138827f6e
[66.945613] RSP: 002b:00007ffecd79edf8 EFLAGS: 00000246 ORIG_RAX: 00000000000000a5
[66.977214] RAX: ffffffffffffffda RBX: 00007f013894c264 RCX: 00007f0138827f6e
[66.994266] RDX: 00005593a4a41360 RSI: 00005593a4a33690 RDI: 00005593a4a3a6c0
[67.011544] RBP: 00005593a4a33440 R08: 0000000000000000 R09: 0000000000000001
[67.028836] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[67.045812] R13: 00005593a4a3a6c0 R14: 00005593a4a41360 R15: 00005593a4a33440
[67.216138] ---[ end trace e114b111db64298c ]---
[67.237089] RIP: 0010:extent_io_tree_panic.isra.0+0x23/0x25 [btrfs]
[67.325317] RSP: 0018:ffff93e5414c3948 EFLAGS: 00010246
[67.347946] RAX: 0000000000000000 RBX: 0000000001bfffff RCX: 0000000000000000
[67.371343] RDX: 0000000000000000 RSI: ffffffffb90d4660 RDI: 00000000ffffffff
[67.394757] RBP: ffff93e5414c3948 R08: 0000000000000001 R09: 0000000000000001
[67.418409] R10: ffff93e5414c3658 R11: 0000000000000000 R12: ffff8ec782d728c0
[67.441906] R13: ffff8ec78bc71628 R14: ffff8ec782d72aa0 R15: 0000000002400000
[67.465436] FS: 00007f01386a8580(0000) GS:ffff8ec809000000(0000) knlGS:0000000000000000
[67.511660] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[67.535047] CR2: 00007f01382fa000 CR3: 0000000109a34000 CR4: 0000000000750ee0
[67.558449] PKRU: 55555554
[67.581146] note: mount[613] exited with preempt_count 2

The image has a chunk item which has a logical start 37748736 and length
18446744073701163008 (-8M). The calculated end 29360127 overflows.
EEXIST was caught by insert_state() because of the duplicate end and
extent_io_tree_panic() was called.

Add overflow check of chunk item end to tree checker so it can be
detected early at mount time.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain
Signed-off-by: Su Yue
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin

Su Yue
2021-01-20 01:27:22 +0800
531c88c9f cifs: fix interrupted close commands ... Browse Code »

commit 2659d3bff3e1b000f49907d0839178b101a89887 upstream.

Retry close command if it gets interrupted to not leak open handles on
the server.

Signed-off-by: Paulo Alcantara (SUSE)
Reported-by: Duncan Findlay
Suggested-by: Pavel Shilovsky
Fixes: 6988a619f5b7 ("cifs: allow syscalls to be restarted in __smb_send_rqst()")
Cc: stable@vger.kernel.org
Reviewd-by: Pavel Shilovsky
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Paulo Alcantara
2021-01-20 01:27:19 +0800
0e4c42cb4 cifs: check pointer before freeing ... Browse Code »

commit 77b6ec01c29aade01701aa30bf1469acc7f2be76 upstream.

clang static analysis reports this problem

dfs_cache.c:591:2: warning: Argument to kfree() is a constant address
(18446744073709551614), which is not memory allocated by malloc()
kfree(vi);
^~~~~~~~~

In dfs_cache_del_vol() the volume info pointer 'vi' being freed
is the return of a call to find_vol(). The large constant address
is find_vol() returning an error.

Add an error check to dfs_cache_del_vol() similar to the one done
in dfs_cache_update_vol().

Fixes: 54be1f6c1c37 ("cifs: Add DFS cache routines")
Signed-off-by: Tom Rix
Reviewed-by: Nathan Chancellor
CC: # v5.0+
Signed-off-by: Steve French
Signed-off-by: Greg Kroah-Hartman

Tom Rix
2021-01-20 01:27:19 +0800
2207c3ce7 ext4: fix bug for rename with RENAME_WHITEOUT ... Browse Code »

commit 6b4b8e6b4ad8553660421d6360678b3811d5deb9 upstream.

We got a "deleted inode referenced" warning cross our fsstress test. The
bug can be reproduced easily with following steps:

cd /dev/shm
mkdir test/
fallocate -l 128M img
mkfs.ext4 -b 1024 img
mount img test/
dd if=/dev/zero of=test/foo bs=1M count=128
mkdir test/dir/ && cd test/dir/
for ((i=0;i
Signed-off-by: Greg Kroah-Hartman

yangerkun
2021-01-20 01:27:19 +0800
15a062c79 ext4: fix wrong list_splice in ext4_fc_cleanup ... Browse Code »

commit 31e203e09f036f48e7c567c2d32df0196bbd303f upstream.

After full/fast commit, entries in staging queue are promoted to main
queue. In ext4_fs_cleanup function, it splice to staging queue to
staging queue.

Fixes: aa75f4d3daaeb ("ext4: main fast-commit commit path")
Signed-off-by: Daejun Park
Reviewed-by: Harshad Shirwadkar
Link: https://lore.kernel.org/r/20201230094851epcms2p6eeead8cc984379b37b2efd21af90fd1a@epcms2p6
Signed-off-by: Theodore Ts'o
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman

Daejun Park
2021-01-20 01:27:19 +0800
6c557cb1f ext4: use IS_ERR instead of IS_ERR_OR_NULL and set inode null when IS_ERR ... Browse Code »

commit 23dd561ad9eae02b4d51bb502fe4e1a0666e9567 upstream.

1: ext4_iget/ext4_find_extent never returns NULL, use IS_ERR
instead of IS_ERR_OR_NULL to fix this.

2: ext4_fc_replay_inode should set the inode to NULL when IS_ERR.
and go to call iput properly.

Fixes: 8016e29f4362 ("ext4: fast commit recovery path")
Signed-off-by: Yi Li
Reviewed-by: Jan Kara
Link: https://lore.kernel.org/r/20201230033827.3996064-1-yili@winhong.com
Signed-off-by: Theodore Ts'o
Cc: stable@kernel.org
Signed-off-by: Greg Kroah-Hartman

Yi Li
2021-01-20 01:27:19 +0800
f37fba66a btrfs: prevent NULL pointer dereference in extent_io_tree_panic ... Browse Code »

commit 29b665cc51e8b602bf2a275734349494776e3dbc upstream.

Some extent io trees are initialized with NULL private member (e.g.
btrfs_device::alloc_state and btrfs_fs_info::excluded_extents).
Dereference of a NULL tree->private as inode pointer will cause panic.

Pass tree->fs_info as it's known to be valid in all cases.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=208929
Fixes: 05912a3c04eb ("btrfs: drop extent_io_ops::tree_fs_info callback")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Anand Jain
Signed-off-by: Su Yue
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Su Yue
2021-01-20 01:27:17 +0800
e883eb5d1 btrfs: reloc: fix wrong file extent type check to avoid false ENOENT ... Browse Code »

commit 50e31ef486afe60f128d42fb9620e2a63172c15c upstream.

[BUG]
There are several bug reports about recent kernel unable to relocate
certain data block groups.

Sometimes the error just goes away, but there is one reporter who can
reproduce it reliably.

The dmesg would look like:

[438.260483] BTRFS info (device dm-10): balance: start -dvrange=34625344765952..34625344765953
[438.269018] BTRFS info (device dm-10): relocating block group 34625344765952 flags data|raid1
[450.439609] BTRFS info (device dm-10): found 167 extents, stage: move data extents
[463.501781] BTRFS info (device dm-10): balance: ended with status: -2

[CAUSE]
The ENOENT error is returned from the following call chain:

add_data_references()
|- delete_v1_space_cache();
|- if (!found)
return -ENOENT;

The variable @found is set to true if we find a data extent whose
disk bytenr matches parameter @data_bytes.

With extra debugging, the offending tree block looks like this:

leaf bytenr = 42676709441536, data_bytenr = 34626327621632

ctime 1567904822.739884119 (2019-09-08 03:07:02)
mtime 0.0 (1970-01-01 01:00:00)
otime 0.0 (1970-01-01 01:00:00)
item 27 key (51933 EXTENT_DATA 0) itemoff 9854 itemsize 53
generation 1517381 type 2 (prealloc)
prealloc data disk byte 34626327621632 nr 262144 <<<
prealloc data offset 0 nr 262144
item 28 key (52262 ROOT_ITEM 0) itemoff 9415 itemsize 439
generation 2618893 root_dirid 256 bytenr 42677048360960 level 3 refs 1
lastsnap 2618893 byte_limit 0 bytes_used 5557338112 flags 0x0(none)
uuid d0d4361f-d231-6d40-8901-fe506e4b2b53

Although item 27 has disk bytenr 34626327621632, which matches the
data_bytenr, its type is prealloc, not reg.
This makes the existing code skip that item, and return ENOENT.

[FIX]
The code is modified in commit 19b546d7a1b2 ("btrfs: relocation: Use
btrfs_find_all_leafs to locate data extent parent tree leaves"), before
that commit, we use something like

"if (type == BTRFS_FILE_EXTENT_INLINE) continue;"

But in that offending commit, we use (type == BTRFS_FILE_EXTENT_REG),
ignoring BTRFS_FILE_EXTENT_PREALLOC.

Fix it by also checking BTRFS_FILE_EXTENT_PREALLOC.

Reported-by: Stéphane Lesimple
Link: https://lore.kernel.org/linux-btrfs/505cabfa88575ed6dbe7cb922d8914fb@lesimple.fr
Fixes: 19b546d7a1b2 ("btrfs: relocation: Use btrfs_find_all_leafs to locate data extent parent tree leaves")
CC: stable@vger.kernel.org # 5.6+
Tested-By: Stéphane Lesimple
Reviewed-by: Su Yue
Signed-off-by: Qu Wenruo
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Greg Kroah-Hartman

Qu Wenruo
2021-01-20 01:27:17 +0800

19 Jan, 2021

1 commit

f11e1751a Merge 5.10.8 into android12-5.10 ... Browse Code »

Changes in 5.10.8
powerpc/32s: Fix RTAS machine check with VMAP stack
io_uring: synchronise IOPOLL on task_submit fail
io_uring: limit {io|sq}poll submit locking scope
io_uring: patch up IOPOLL overflow_flush sync
RDMA/hns: Avoid filling sl in high 3 bits of vlan_id
iommu/arm-smmu-qcom: Initialize SCTLR of the bypass context
drm/panfrost: Don't corrupt the queue mutex on open/close
io_uring: Fix return value from alloc_fixed_file_ref_node
scsi: ufs: Fix -Wsometimes-uninitialized warning
btrfs: skip unnecessary searches for xattrs when logging an inode
btrfs: fix deadlock when cloning inline extent and low on free metadata space
btrfs: shrink delalloc pages instead of full inodes
net: cdc_ncm: correct overhead in delayed_ndp_size
net: hns3: fix incorrect handling of sctp6 rss tuple
net: hns3: fix the number of queues actually used by ARQ
net: hns3: fix a phy loopback fail issue
net: stmmac: dwmac-sun8i: Fix probe error handling
net: stmmac: dwmac-sun8i: Balance internal PHY resource references
net: stmmac: dwmac-sun8i: Balance internal PHY power
net: stmmac: dwmac-sun8i: Balance syscon (de)initialization
net: vlan: avoid leaks on register_vlan_dev() failures
net/sonic: Fix some resource leaks in error handling paths
net: bareudp: add missing error handling for bareudp_link_config()
ptp: ptp_ines: prevent build when HAS_IOMEM is not set
net: ipv6: fib: flush exceptions when purging route
tools: selftests: add test for changing routes with PTMU exceptions
net: fix pmtu check in nopmtudisc mode
net: ip: always refragment ip defragmented packets
chtls: Fix hardware tid leak
chtls: Remove invalid set_tcb call
chtls: Fix panic when route to peer not configured
chtls: Avoid unnecessary freeing of oreq pointer
chtls: Replace skb_dequeue with skb_peek
chtls: Added a check to avoid NULL pointer dereference
chtls: Fix chtls resources release sequence
octeontx2-af: fix memory leak of lmac and lmac->name
nexthop: Fix off-by-one error in error path
nexthop: Unlink nexthop group entry in error path
nexthop: Bounce NHA_GATEWAY in FDB nexthop groups
s390/qeth: fix deadlock during recovery
s390/qeth: fix locking for discipline setup / removal
s390/qeth: fix L2 header access in qeth_l3_osa_features_check()
net: dsa: lantiq_gswip: Exclude RMII from modes that report 1 GbE
net/mlx5: Use port_num 1 instead of 0 when delete a RoCE address
net/mlx5e: ethtool, Fix restriction of autoneg with 56G
net/mlx5e: In skb build skip setting mark in switchdev mode
net/mlx5: Check if lag is supported before creating one
scsi: lpfc: Fix variable 'vport' set but not used in lpfc_sli4_abts_err_handler()
ionic: start queues before announcing link up
HID: wacom: Fix memory leakage caused by kfifo_alloc
fanotify: Fix sys_fanotify_mark() on native x86-32
ARM: OMAP2+: omap_device: fix idling of devices during probe
i2c: sprd: use a specific timeout to avoid system hang up issue
dmaengine: dw-edma: Fix use after free in dw_edma_alloc_chunk()
selftests/bpf: Clarify build error if no vmlinux
can: tcan4x5x: fix bittiming const, use common bittiming from m_can driver
can: m_can: m_can_class_unregister(): remove erroneous m_can_clk_stop()
can: kvaser_pciefd: select CONFIG_CRC32
spi: spi-geni-qcom: Fail new xfers if xfer/cancel/abort pending
cpufreq: powernow-k8: pass policy rather than use cpufreq_cpu_get()
spi: spi-geni-qcom: Fix geni_spi_isr() NULL dereference in timeout case
spi: stm32: FIFO threshold level - fix align packet size
i2c: i801: Fix the i2c-mux gpiod_lookup_table not being properly terminated
i2c: mediatek: Fix apdma and i2c hand-shake timeout
bcache: set bcache device into read-only mode for BCH_FEATURE_INCOMPAT_OBSO_LARGE_BUCKET
interconnect: imx: Add a missing of_node_put after of_device_is_available
interconnect: qcom: fix rpmh link failures
dmaengine: mediatek: mtk-hsdma: Fix a resource leak in the error handling path of the probe function
dmaengine: milbeaut-xdmac: Fix a resource leak in the error handling path of the probe function
dmaengine: xilinx_dma: check dma_async_device_register return value
dmaengine: xilinx_dma: fix incompatible param warning in _child_probe()
dmaengine: xilinx_dma: fix mixed_enum_type coverity warning
arm64: mm: Fix ARCH_LOW_ADDRESS_LIMIT when !CONFIG_ZONE_DMA
qed: select CONFIG_CRC32
phy: dp83640: select CONFIG_CRC32
wil6210: select CONFIG_CRC32
block: rsxx: select CONFIG_CRC32
lightnvm: select CONFIG_CRC32
zonefs: select CONFIG_CRC32
iommu/vt-d: Fix misuse of ALIGN in qi_flush_piotlb()
iommu/intel: Fix memleak in intel_irq_remapping_alloc
bpftool: Fix compilation failure for net.o with older glibc
nvme-tcp: Fix possible race of io_work and direct send
net/mlx5e: Fix memleak in mlx5e_create_l2_table_groups
net/mlx5e: Fix two double free cases
regmap: debugfs: Fix a memory leak when calling regmap_attach_dev
wan: ds26522: select CONFIG_BITREVERSE
arm64: cpufeature: remove non-exist CONFIG_KVM_ARM_HOST
regulator: qcom-rpmh-regulator: correct hfsmps515 definition
net: mvpp2: disable force link UP during port init procedure
drm/i915/dp: Track pm_qos per connector
net: mvneta: fix error message when MTU too large for XDP
selftests: fib_nexthops: Fix wrong mausezahn invocation
KVM: arm64: Don't access PMCR_EL0 when no PMU is available
xsk: Fix race in SKB mode transmit with shared cq
xsk: Rollback reservation at NETDEV_TX_BUSY
block/rnbd-clt: avoid module unload race with close confirmation
can: isotp: isotp_getname(): fix kernel information leak
block: fix use-after-free in disk_part_iter_next
net: drop bogus skb with CHECKSUM_PARTIAL and offset beyond end of trimmed packet
regmap: debugfs: Fix a reversed if statement in regmap_debugfs_init()
drm/panfrost: Remove unused variables in panfrost_job_close()
tools headers UAPI: Sync linux/fscrypt.h with the kernel sources
Linux 5.10.8

Signed-off-by: Greg Kroah-Hartman
Change-Id: Ib8272ec9f47a3c3813509bcacece3b16137332e1

Greg Kroah-Hartman
2021-01-19 16:33:21 +0800

17 Jan, 2021

5 commits

2bbe923d7 zonefs: select CONFIG_CRC32 ... Browse Code »

commit 4f8b848788f77c7f5c3bd98febce66b7aa14785f upstream.

When CRC32 is disabled, zonefs cannot be linked:

ld: fs/zonefs/super.o: in function `zonefs_fill_super':

Add a Kconfig 'select' statement for it.

Fixes: 8dcc1a9d90c1 ("fs: New zonefs file system")
Signed-off-by: Arnd Bergmann
Signed-off-by: Damien Le Moal
Signed-off-by: Greg Kroah-Hartman

Arnd Bergmann
2021-01-17 21:17:03 +0800
797335659 fanotify: Fix sys_fanotify_mark() on native x86-32 ... Browse Code »

commit 2ca408d9c749c32288bc28725f9f12ba30299e8f upstream.

Commit

121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments")

converted native x86-32 which take 64-bit arguments to use the
compat handlers to allow conversion to passing args via pt_regs.
sys_fanotify_mark() was however missed, as it has a general compat
handler. Add a config option that will use the syscall wrapper that
takes the split args for native 32-bit.

[ bp: Fix typo in Kconfig help text. ]

Fixes: 121b32a58a3a ("x86/entry/32: Use IA32-specific wrappers for syscalls taking 64-bit arguments")
Reported-by: Paweł Jasiak
Signed-off-by: Brian Gerst
Signed-off-by: Borislav Petkov
Acked-by: Jan Kara
Acked-by: Andy Lutomirski
Link: https://lkml.kernel.org/r/20201130223059.101286-1-brgerst@gmail.com
Signed-off-by: Greg Kroah-Hartman

Brian Gerst
2021-01-17 21:16:59 +0800
e3b5252b5 btrfs: shrink delalloc pages instead of full inodes ... Browse Code »

[ Upstream commit e076ab2a2ca70a0270232067cd49f76cd92efe64 ]

Commit 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in
shrink_delalloc") cleaned up how we do delalloc shrinking by utilizing
some infrastructure we have in place to flush inodes that we use for
device replace and snapshot. However this introduced a pretty serious
performance regression. To reproduce the user untarred the source
tarball of Firefox (360MiB xz compressed/1.5GiB uncompressed), and would
see it take anywhere from 5 to 20 times as long to untar in 5.10
compared to 5.9. This was observed on fast devices (SSD and better) and
not on HDD.

The root cause is because before we would generally use the normal
writeback path to reclaim delalloc space, and for this we would provide
it with the number of pages we wanted to flush. The referenced commit
changed this to flush that many inodes, which drastically increased the
amount of space we were flushing in certain cases, which severely
affected performance.

We cannot revert this patch unfortunately because of 3d45f221ce62
("btrfs: fix deadlock when cloning inline extent and low on free
metadata space") which requires the ability to skip flushing inodes that
are being cloned in certain scenarios, which means we need to keep using
our flushing infrastructure or risk re-introducing the deadlock.

Instead to fix this problem we can go back to providing
btrfs_start_delalloc_roots with a number of pages to flush, and then set
up a writeback_control and utilize sync_inode() to handle the flushing
for us. This gives us the same behavior we had prior to the fix, while
still allowing us to avoid the deadlock that was fixed by Filipe. I
redid the users original test and got the following results on one of
our test machines (256GiB of ram, 56 cores, 2TiB Intel NVMe drive)

5.9 0m54.258s
5.10 1m26.212s
5.10+patch 0m38.800s

5.10+patch is significantly faster than plain 5.9 because of my patch
series "Change data reservations to use the ticketing infra" which
contained the patch that introduced the regression, but generally
improved the overall ENOSPC flushing mechanisms.

Additional testing on consumer-grade SSD (8GiB ram, 8 CPU) confirm
the results:

5.10.5 4m00s
5.10.5+patch 1m08s
5.11-rc2 5m14s
5.11-rc2+patch 1m30s

Reported-by: René Rebe
Fixes: 38d715f494f2 ("btrfs: use btrfs_start_delalloc_roots in shrink_delalloc")
CC: stable@vger.kernel.org # 5.10
Signed-off-by: Josef Bacik
Tested-by: David Sterba
Reviewed-by: David Sterba
[ add my test results ]
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin

Josef Bacik
2021-01-17 21:16:54 +0800
17243f73a btrfs: fix deadlock when cloning inline extent and low on free metadata space ... Browse Code »

[ Upstream commit 3d45f221ce627d13e2e6ef3274f06750c84a6542 ]

When cloning an inline extent there are cases where we can not just copy
the inline extent from the source range to the target range (e.g. when the
target range starts at an offset greater than zero). In such cases we copy
the inline extent's data into a page of the destination inode and then
dirty that page. However, after that we will need to start a transaction
for each processed extent and, if we are ever low on available metadata
space, we may need to flush existing delalloc for all dirty inodes in an
attempt to release metadata space - if that happens we may deadlock:

* the async reclaim task queued a delalloc work to flush delalloc for
the destination inode of the clone operation;

* the task executing that delalloc work gets blocked waiting for the
range with the dirty page to be unlocked, which is currently locked
by the task doing the clone operation;

* the async reclaim task blocks waiting for the delalloc work to complete;

* the cloning task is waiting on the waitqueue of its reservation ticket
while holding the range with the dirty page locked in the inode's
io_tree;

* if metadata space is not released by some other task (like delalloc for
some other inode completing for example), the clone task waits forever
and as a consequence the delalloc work and async reclaim tasks will hang
forever as well. Releasing more space on the other hand may require
starting a transaction, which will hang as well when trying to reserve
metadata space, resulting in a deadlock between all these tasks.

When this happens, traces like the following show up in dmesg/syslog:

[87452.323003] INFO: task kworker/u16:11:1810830 blocked for more than 120 seconds.
[87452.323644] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.324248] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.324852] task:kworker/u16:11 state:D stack: 0 pid:1810830 ppid: 2 flags:0x00004000
[87452.325520] Workqueue: btrfs-flush_delalloc btrfs_work_helper [btrfs]
[87452.326136] Call Trace:
[87452.326737] __schedule+0x5d1/0xcf0
[87452.327390] schedule+0x45/0xe0
[87452.328174] lock_extent_bits+0x1e6/0x2d0 [btrfs]
[87452.328894] ? finish_wait+0x90/0x90
[87452.329474] btrfs_invalidatepage+0x32c/0x390 [btrfs]
[87452.330133] ? __mod_memcg_state+0x8e/0x160
[87452.330738] __extent_writepage+0x2d4/0x400 [btrfs]
[87452.331405] extent_write_cache_pages+0x2b2/0x500 [btrfs]
[87452.332007] ? lock_release+0x20e/0x4c0
[87452.332557] ? trace_hardirqs_on+0x1b/0xf0
[87452.333127] extent_writepages+0x43/0x90 [btrfs]
[87452.333653] ? lock_acquire+0x1a3/0x490
[87452.334177] do_writepages+0x43/0xe0
[87452.334699] ? __filemap_fdatawrite_range+0xa4/0x100
[87452.335720] __filemap_fdatawrite_range+0xc5/0x100
[87452.336500] btrfs_run_delalloc_work+0x17/0x40 [btrfs]
[87452.337216] btrfs_work_helper+0xf1/0x600 [btrfs]
[87452.337838] process_one_work+0x24e/0x5e0
[87452.338437] worker_thread+0x50/0x3b0
[87452.339137] ? process_one_work+0x5e0/0x5e0
[87452.339884] kthread+0x153/0x170
[87452.340507] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.341153] ret_from_fork+0x22/0x30
[87452.341806] INFO: task kworker/u16:1:2426217 blocked for more than 120 seconds.
[87452.342487] Tainted: G B W 5.10.0-rc4-btrfs-next-73 #1
[87452.343274] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[87452.344049] task:kworker/u16:1 state:D stack: 0 pid:2426217 ppid: 2 flags:0x00004000
[87452.344974] Workqueue: events_unbound btrfs_async_reclaim_metadata_space [btrfs]
[87452.345655] Call Trace:
[87452.346305] __schedule+0x5d1/0xcf0
[87452.346947] ? kvm_clock_read+0x14/0x30
[87452.347676] ? wait_for_completion+0x81/0x110
[87452.348389] schedule+0x45/0xe0
[87452.349077] schedule_timeout+0x30c/0x580
[87452.349718] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[87452.350340] ? lock_acquire+0x1a3/0x490
[87452.351006] ? try_to_wake_up+0x7a/0xa20
[87452.351541] ? lock_release+0x20e/0x4c0
[87452.352040] ? lock_acquired+0x199/0x490
[87452.352517] ? wait_for_completion+0x81/0x110
[87452.353000] wait_for_completion+0xab/0x110
[87452.353490] start_delalloc_inodes+0x2af/0x390 [btrfs]
[87452.353973] btrfs_start_delalloc_roots+0x12d/0x250 [btrfs]
[87452.354455] flush_space+0x24f/0x660 [btrfs]
[87452.355063] btrfs_async_reclaim_metadata_space+0x1bb/0x480 [btrfs]
[87452.355565] process_one_work+0x24e/0x5e0
[87452.356024] worker_thread+0x20f/0x3b0
[87452.356487] ? process_one_work+0x5e0/0x5e0
[87452.356973] kthread+0x153/0x170
[87452.357434] ? kthread_mod_delayed_work+0xc0/0xc0
[87452.357880] ret_from_fork+0x22/0x30
(...)
< stack traces of several tasks waiting for the locks of the inodes of the
clone operation >
(...)
[92867.444138] RSP: 002b:00007ffc3371bbe8 EFLAGS: 00000246 ORIG_RAX: 0000000000000052
[92867.444624] RAX: ffffffffffffffda RBX: 00007ffc3371bea0 RCX: 00007f61efe73f97
[92867.445116] RDX: 0000000000000000 RSI: 0000560fbd5d7a40 RDI: 0000560fbd5d8960
[92867.445595] RBP: 00007ffc3371beb0 R08: 0000000000000001 R09: 0000000000000003
[92867.446070] R10: 00007ffc3371b996 R11: 0000000000000246 R12: 0000000000000000
[92867.446820] R13: 000000000000001f R14: 00007ffc3371bea0 R15: 00007ffc3371beb0
[92867.447361] task:fsstress state:D stack: 0 pid:2508238 ppid:2508153 flags:0x00004000
[92867.447920] Call Trace:
[92867.448435] __schedule+0x5d1/0xcf0
[92867.448934] ? _raw_spin_unlock_irqrestore+0x3c/0x60
[92867.449423] schedule+0x45/0xe0
[92867.449916] __reserve_bytes+0x4a4/0xb10 [btrfs]
[92867.450576] ? finish_wait+0x90/0x90
[92867.451202] btrfs_reserve_metadata_bytes+0x29/0x190 [btrfs]
[92867.451815] btrfs_block_rsv_add+0x1f/0x50 [btrfs]
[92867.452412] start_transaction+0x2d1/0x760 [btrfs]
[92867.453216] clone_copy_inline_extent+0x333/0x490 [btrfs]
[92867.453848] ? lock_release+0x20e/0x4c0
[92867.454539] ? btrfs_search_slot+0x9a7/0xc30 [btrfs]
[92867.455218] btrfs_clone+0x569/0x7e0 [btrfs]
[92867.455952] btrfs_clone_files+0xf6/0x150 [btrfs]
[92867.456588] btrfs_remap_file_range+0x324/0x3d0 [btrfs]
[92867.457213] do_clone_file_range+0xd4/0x1f0
[92867.457828] vfs_clone_file_range+0x4d/0x230
[92867.458355] ? lock_release+0x20e/0x4c0
[92867.458890] ioctl_file_clone+0x8f/0xc0
[92867.459377] do_vfs_ioctl+0x342/0x750
[92867.459913] __x64_sys_ioctl+0x62/0xb0
[92867.460377] do_syscall_64+0x33/0x80
[92867.460842] entry_SYSCALL_64_after_hwframe+0x44/0xa9
(...)
< stack traces of more tasks blocked on metadata reservation like the clone
task above, because the async reclaim task has deadlocked >
(...)

Another thing to notice is that the worker task that is deadlocked when
trying to flush the destination inode of the clone operation is at
btrfs_invalidatepage(). This is simply because the clone operation has a
destination offset greater than the i_size and we only update the i_size
of the destination file after cloning an extent (just like we do in the
buffered write path).

Since the async reclaim path uses btrfs_start_delalloc_roots() to trigger
the flushing of delalloc for all inodes that have delalloc, add a runtime
flag to an inode to signal it should not be flushed, and for inodes with
that flag set, start_delalloc_inodes() will simply skip them. When the
cloning code needs to dirty a page to copy an inline extent, set that flag
on the inode and then clear it when the clone operation finishes.

This could be sporadically triggered with test case generic/269 from
fstests, which exercises many fsstress processes running in parallel with
several dd processes filling up the entire filesystem.

CC: stable@vger.kernel.org # 5.9+
Fixes: 05a5a7621ce6 ("Btrfs: implement full reflink support for inline extents")
Reviewed-by: Josef Bacik
Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin

Filipe Manana
2021-01-17 21:16:54 +0800
877381645 btrfs: skip unnecessary searches for xattrs when logging an inode ... Browse Code »

[ Upstream commit f2f121ab500d0457cc9c6f54269d21ffdf5bd304 ]

Every time we log an inode we lookup in the fs/subvol tree for xattrs and
if we have any, log them into the log tree. However it is very common to
have inodes without any xattrs, so doing the search wastes times, but more
importantly it adds contention on the fs/subvol tree locks, either making
the logging code block and wait for tree locks or making the logging code
making other concurrent operations block and wait.

The most typical use cases where xattrs are used are when capabilities or
ACLs are defined for an inode, or when SELinux is enabled.

This change makes the logging code detect when an inode does not have
xattrs and skip the xattrs search the next time the inode is logged,
unless the inode is evicted and loaded again or a xattr is added to the
inode. Therefore skipping the search for xattrs on inodes that don't ever
have xattrs and are fsynced with some frequency.

The following script that calls dbench was used to measure the impact of
this change on a VM with 8 CPUs, 16Gb of ram, using a raw NVMe device
directly (no intermediary filesystem on the host) and using a non-debug
kernel (default configuration on Debian distributions):

$ cat test.sh
#!/bin/bash

DEV=/dev/sdk
MNT=/mnt/sdk
MOUNT_OPTIONS="-o ssd"

mkfs.btrfs -f -m single -d single $DEV
mount $MOUNT_OPTIONS $DEV $MNT

dbench -D $MNT -t 200 40

umount $MNT

The results before this change:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 5761605 0.172 312.057
Close 4232452 0.002 10.927
Rename 243937 1.406 277.344
Unlink 1163456 0.631 298.402
Deltree 160 11.581 221.107
Mkdir 80 0.003 0.005
Qpathinfo 5221410 0.065 122.309
Qfileinfo 915432 0.001 3.333
Qfsinfo 957555 0.003 3.992
Sfileinfo 469244 0.023 20.494
Find 2018865 0.448 123.659
WriteX 2874851 0.049 118.529
ReadX 9030579 0.004 21.654
LockX 18754 0.003 4.423
UnlockX 18754 0.002 0.331
Flush 403792 10.944 359.494

Throughput 908.444 MB/sec 40 clients 40 procs max_latency=359.500 ms

The results after this change:

Operation Count AvgLat MaxLat
----------------------------------------
NTCreateX 6442521 0.159 230.693
Close 4732357 0.002 10.972
Rename 272809 1.293 227.398
Unlink 1301059 0.563 218.500
Deltree 160 7.796 54.887
Mkdir 80 0.008 0.478
Qpathinfo 5839452 0.047 124.330
Qfileinfo 1023199 0.001 4.996
Qfsinfo 1070760 0.003 5.709
Sfileinfo 524790 0.033 21.765
Find 2257658 0.314 125.611
WriteX 3211520 0.040 232.135
ReadX 10098969 0.004 25.340
LockX 20974 0.003 1.569
UnlockX 20974 0.002 3.475
Flush 451553 10.287 331.037

Throughput 1011.77 MB/sec 40 clients 40 procs max_latency=331.045 ms

+10.8% throughput, -8.2% max latency

Reviewed-by: Josef Bacik
Signed-off-by: Filipe Manana
Reviewed-by: David Sterba
Signed-off-by: David Sterba
Signed-off-by: Sasha Levin

Filipe Manana
2021-01-17 21:16:53 +0800