Eric Lee / smarc-fsl-linux-kernel

23 Sep, 2022

1 commit

39b023528 KVM: SEV: add cache flush to solve SEV cache incoherency issues ... Browse Code »

commit 683412ccf61294d727ead4a73d97397396e69a6b upstream.

Flush the CPU caches when memory is reclaimed from an SEV guest (where
reclaim also includes it being unmapped from KVM's memslots). Due to lack
of coherency for SEV encrypted memory, failure to flush results in silent
data corruption if userspace is malicious/broken and doesn't ensure SEV
guest memory is properly pinned and unpinned.

Cache coherency is not enforced across the VM boundary in SEV (AMD APM
vol.2 Section 15.34.7). Confidential cachelines, generated by confidential
VM guests have to be explicitly flushed on the host side. If a memory page
containing dirty confidential cachelines was released by VM and reallocated
to another user, the cachelines may corrupt the new user at a later time.

KVM takes a shortcut by assuming all confidential memory remain pinned
until the end of VM lifetime. Therefore, KVM does not flush cache at
mmu_notifier invalidation events. Because of this incorrect assumption and
the lack of cache flushing, malicous userspace can crash the host kernel:
creating a malicious VM and continuously allocates/releases unpinned
confidential memory pages when the VM is running.

Add cache flush operations to mmu_notifier operations to ensure that any
physical memory leaving the guest VM get flushed. In particular, hook
mmu_notifier_invalidate_range_start and mmu_notifier_release events and
flush cache accordingly. The hook after releasing the mmu lock to avoid
contention with other vCPUs.

Cc: stable@vger.kernel.org
Suggested-by: Sean Christpherson
Reported-by: Mingwei Zhang
Signed-off-by: Mingwei Zhang
Message-Id:
Signed-off-by: Paolo Bonzini
[OP: adjusted KVM_X86_OP_OPTIONAL() -> KVM_X86_OP_NULL, applied
kvm_arch_guest_memory_reclaimed() call in kvm_set_memslot()]
Signed-off-by: Ovidiu Panait
Signed-off-by: Greg Kroah-Hartman

Mingwei Zhang
2022-09-23 20:15:52 +0800

25 Aug, 2022

1 commit

177bf3542 KVM: Unconditionally get a ref to /dev/kvm module when creating a VM ... Browse Code »

commit 405294f29faee5de8c10cb9d4a90e229c2835279 upstream.

Unconditionally get a reference to the /dev/kvm module when creating a VM
instead of using try_get_module(), which will fail if the module is in
the process of being forcefully unloaded. The error handling when
try_get_module() fails doesn't properly unwind all that has been done,
e.g. doesn't call kvm_arch_pre_destroy_vm() and doesn't remove the VM
from the global list. Not removing VMs from the global list tends to be
fatal, e.g. leads to use-after-free explosions.

The obvious alternative would be to add proper unwinding, but the
justification for using try_get_module(), "rmmod --wait", is completely
bogus as support for "rmmod --wait", i.e. delete_module() without
O_NONBLOCK, was removed by commit 3f2b9c9cdf38 ("module: remove rmmod
--wait option.") nearly a decade ago.

It's still possible for try_get_module() to fail due to the module dying
(more like being killed), as the module will be tagged MODULE_STATE_GOING
by "rmmod --force", i.e. delete_module(..., O_TRUNC), but playing nice
with forced unloading is an exercise in futility and gives a falsea sense
of security. Using try_get_module() only prevents acquiring _new_
references, it doesn't magically put the references held by other VMs,
and forced unloading doesn't wait, i.e. "rmmod --force" on KVM is all but
guaranteed to cause spectacular fireworks; the window where KVM will fail
try_get_module() is tiny compared to the window where KVM is building and
running the VM with an elevated module refcount.

Addressing KVM's inability to play nice with "rmmod --force" is firmly
out-of-scope. Forcefully unloading any module taints kernel (for obvious
reasons) _and_ requires the kernel to be built with
CONFIG_MODULE_FORCE_UNLOAD=y, which is off by default and comes with the
amusing disclaimer that it's "mainly for kernel developers and desperate
users". In other words, KVM is free to scoff at bug reports due to using
"rmmod --force" while VMs may be running.

Fixes: 5f6de5cbebee ("KVM: Prevent module exit until all VMs are freed")
Cc: stable@vger.kernel.org
Cc: David Matlack
Signed-off-by: Sean Christopherson
Message-Id:
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Sean Christopherson
2022-08-25 17:39:53 +0800

17 Aug, 2022

1 commit

26cdeedbb KVM: Don't set Accessed/Dirty bits for ZERO_PAGE ... Browse Code »

[ Upstream commit a1040b0d42acf69bb4f6dbdc54c2dcd78eea1de5 ]

Don't set Accessed/Dirty bits for a struct page with PG_reserved set,
i.e. don't set A/D bits for the ZERO_PAGE. The ZERO_PAGE (or pages
depending on the architecture) should obviously never be written, and
similarly there's no point in marking it accessed as the page will never
be swapped out or reclaimed. The comment in page-flags.h is quite clear
that PG_reserved pages should be managed only by their owner, and
strictly following that mandate also simplifies KVM's logic.

Fixes: 7df003c85218 ("KVM: fix overflow of zero page refcount with ksm running")
Signed-off-by: Sean Christopherson
Message-Id:
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin

Sean Christopherson
2022-08-17 20:23:44 +0800

29 Jul, 2022

1 commit

e91665fbb KVM: Don't null dereference ops->destroy ... Browse Code »

commit e8bc2427018826e02add7b0ed0fc625a60390ae5 upstream.

A KVM device cleanup happens in either of two callbacks:
1) destroy() which is called when the VM is being destroyed;
2) release() which is called when a device fd is closed.

Most KVM devices use 1) but Book3s's interrupt controller KVM devices
(XICS, XIVE, XIVE-native) use 2) as they need to close and reopen during
the machine execution. The error handling in kvm_ioctl_create_device()
assumes destroy() is always defined which leads to NULL dereference as
discovered by Syzkaller.

This adds a checks for destroy!=NULL and adds a missing release().

This is not changing kvm_destroy_devices() as devices with defined
release() should have been removed from the KVM devices list by then.

Suggested-by: Paolo Bonzini
Signed-off-by: Alexey Kardashevskiy
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Alexey Kardashevskiy
2022-07-29 23:25:24 +0800

12 Jul, 2022

3 commits

a0f4fd486 KVM: Initialize debugfs_dentry when a VM is created to avoid NULL deref ... Browse Code »

[ Upstream commit 5c697c367a66307a5d943c3449421aff2aa3ca4a ]

Initialize debugfs_entry to its semi-magical -ENOENT value when the VM
is created. KVM's teardown when VM creation fails is kludgy and calls
kvm_uevent_notify_change() and kvm_destroy_vm_debugfs() even if KVM never
attempted kvm_create_vm_debugfs(). Because debugfs_entry is zero
initialized, the IS_ERR() checks pass and KVM derefs a NULL pointer.

BUG: kernel NULL pointer dereference, address: 0000000000000018
#PF: supervisor read access in kernel mode
#PF: error_code(0x0000) - not-present page
PGD 1068b1067 P4D 1068b1067 PUD 1068b0067 PMD 0
Oops: 0000 [#1] SMP
CPU: 0 PID: 871 Comm: repro Not tainted 5.18.0-rc1+ #825
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
RIP: 0010:__dentry_path+0x7b/0x130
Call Trace:

dentry_path_raw+0x42/0x70
kvm_uevent_notify_change.part.0+0x10c/0x200 [kvm]
kvm_put_kvm+0x63/0x2b0 [kvm]
kvm_dev_ioctl+0x43a/0x920 [kvm]
__x64_sys_ioctl+0x83/0xb0
do_syscall_64+0x31/0x50
entry_SYSCALL_64_after_hwframe+0x44/0xae

Modules linked in: kvm_intel kvm irqbypass

Fixes: a44a4cc1c969 ("KVM: Don't create VM debugfs files outside of the VM directory")
Cc: stable@vger.kernel.org
Cc: Marc Zyngier
Cc: Oliver Upton
Reported-by: syzbot+df6fbbd2ee39f21289ef@syzkaller.appspotmail.com
Signed-off-by: Sean Christopherson
Reviewed-by: Oliver Upton
Message-Id:
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin

Sean Christopherson
2022-07-12 22:35:05 +0800
e73c0eaf7 KVM: Don't create VM debugfs files outside of the VM directory ... Browse Code »

[ Upstream commit a44a4cc1c969afec97dbb2aedaf6f38eaa6253bb ]

Unfortunately, there is no guarantee that KVM was able to instantiate a
debugfs directory for a particular VM. To that end, KVM shouldn't even
attempt to create new debugfs files in this case. If the specified
parent dentry is NULL, debugfs_create_file() will instantiate files at
the root of debugfs.

For arm64, it is possible to create the vgic-state file outside of a
VM directory, the file is not cleaned up when a VM is destroyed.
Nonetheless, the corresponding struct kvm is freed when the VM is
destroyed.

Nip the problem in the bud for all possible errant debugfs file
creations by initializing kvm->debugfs_dentry to -ENOENT. In so doing,
debugfs_create_file() will fail instead of creating the file in the root
directory.

Cc: stable@kernel.org
Fixes: 929f45e32499 ("kvm: no need to check return value of debugfs_create functions")
Signed-off-by: Oliver Upton
Signed-off-by: Marc Zyngier
Link: https://lore.kernel.org/r/20220406235615.1447180-2-oupton@google.com
Signed-off-by: Sasha Levin

Oliver Upton
2022-07-12 22:35:04 +0800
6784b694e KVM: use __vcalloc for very large allocations ... Browse Code »

[ Upstream commit 37b2a6510a48ca361ced679f92682b7b7d7d0330 ]

Allocations whose size is related to the memslot size can be arbitrarily
large. Do not use kvzalloc/kvcalloc, as those are limited to "not crazy"
sizes that fit in 32 bits.

Cc: stable@vger.kernel.org
Fixes: 7661809d493b ("mm: don't allow oversized kvmalloc() calls")
Reviewed-by: David Hildenbrand
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin

Paolo Bonzini
2022-07-12 22:35:02 +0800

14 Apr, 2022

1 commit

226b4327e KVM: avoid NULL pointer dereference in kvm_dirty_ring_push ... Browse Code »

commit 5593473a1e6c743764b08e3b6071cb43b5cfa6c4 upstream.

kvm_vcpu_release() will call kvm_dirty_ring_free(), freeing
ring->dirty_gfns and setting it to NULL. Afterwards, it calls
kvm_arch_vcpu_destroy().

However, if closing the file descriptor races with KVM_RUN in such away
that vcpu->arch.st.preempted == 0, the following call stack leads to a
NULL pointer dereference in kvm_dirty_run_push():

mark_page_dirty_in_slot+0x192/0x270 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3171
kvm_steal_time_set_preempted arch/x86/kvm/x86.c:4600 [inline]
kvm_arch_vcpu_put+0x34e/0x5b0 arch/x86/kvm/x86.c:4618
vcpu_put+0x1b/0x70 arch/x86/kvm/../../../virt/kvm/kvm_main.c:211
vmx_free_vcpu+0xcb/0x130 arch/x86/kvm/vmx/vmx.c:6985
kvm_arch_vcpu_destroy+0x76/0x290 arch/x86/kvm/x86.c:11219
kvm_vcpu_destroy arch/x86/kvm/../../../virt/kvm/kvm_main.c:441 [inline]

The fix is to release the dirty page ring after kvm_arch_vcpu_destroy
has run.

Reported-by: Qiuhao Li
Reported-by: Gaoning Pan
Reported-by: Yongkang Jia
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Paolo Bonzini
2022-04-14 02:59:26 +0800

08 Apr, 2022

1 commit

43637ee17 KVM: Prevent module exit until all VMs are freed ... Browse Code »

commit 5f6de5cbebee925a612856fce6f9182bb3eee0db upstream.

Tie the lifetime the KVM module to the lifetime of each VM via
kvm.users_count. This way anything that grabs a reference to the VM via
kvm_get_kvm() cannot accidentally outlive the KVM module.

Prior to this commit, the lifetime of the KVM module was tied to the
lifetime of /dev/kvm file descriptors, VM file descriptors, and vCPU
file descriptors by their respective file_operations "owner" field.
This approach is insufficient because references grabbed via
kvm_get_kvm() do not prevent closing any of the aforementioned file
descriptors.

This fixes a long standing theoretical bug in KVM that at least affects
async page faults. kvm_setup_async_pf() grabs a reference via
kvm_get_kvm(), and drops it in an asynchronous work callback. Nothing
prevents the VM file descriptor from being closed and the KVM module
from being unloaded before this callback runs.

Fixes: af585b921e5d ("KVM: Halt vcpu if page it tries to access is swapped out")
Fixes: 3d3aab1b973b ("KVM: set owner of cpu and vm file operations")
Cc: stable@vger.kernel.org
Suggested-by: Ben Gardon
[ Based on a patch from Ben implemented for Google's kernel. ]
Signed-off-by: David Matlack
Message-Id:
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

David Matlack
2022-04-08 20:24:07 +0800

16 Mar, 2022

1 commit

e160ee96d KVM: Fix lockdep false negative during host resume ... Browse Code »

[ Upstream commit 4cb9a998b1ce25fad74a82f5a5c45a4ef40de337 ]

I saw the below splatting after the host suspended and resumed.

WARNING: CPU: 0 PID: 2943 at kvm/arch/x86/kvm/../../../virt/kvm/kvm_main.c:5531 kvm_resume+0x2c/0x30 [kvm]
CPU: 0 PID: 2943 Comm: step_after_susp Tainted: G W IOE 5.17.0-rc3+ #4
RIP: 0010:kvm_resume+0x2c/0x30 [kvm]
Call Trace:

syscore_resume+0x90/0x340
suspend_devices_and_enter+0xaee/0xe90
pm_suspend.cold+0x36b/0x3c2
state_store+0x82/0xf0
kernfs_fop_write_iter+0x1b6/0x260
new_sync_write+0x258/0x370
vfs_write+0x33f/0x510
ksys_write+0xc9/0x160
do_syscall_64+0x3b/0xc0
entry_SYSCALL_64_after_hwframe+0x44/0xae

lockdep_is_held() can return -1 when lockdep is disabled which triggers
this warning. Let's use lockdep_assert_not_held() which can detect
incorrect calls while holding a lock and it also avoids false negatives
when lockdep is disabled.

Signed-off-by: Wanpeng Li
Message-Id:
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin

Wanpeng Li
2022-03-16 21:23:40 +0800

09 Mar, 2022

1 commit

32b758d12 KVM: s390: Ensure kvm_arch_no_poll() is read once when blocking vCPU ... Browse Code »

[ Upstream commit 6f390916c4fb359507d9ac4bf1b28a4f8abee5c0 ]

Wrap s390's halt_poll_max_steal with READ_ONCE and snapshot the result of
kvm_arch_no_poll() in kvm_vcpu_block() to avoid a mostly-theoretical,
largely benign bug on s390 where the result of kvm_arch_no_poll() could
change due to userspace modifying halt_poll_max_steal while the vCPU is
blocking. The bug is largely benign as it will either cause KVM to skip
updating halt-polling times (no_poll toggles false=>true) or to update
halt-polling times with a slightly flawed block_ns.

Note, READ_ONCE is unnecessary in the current code, add it in case the
arch hook is ever inlined, and to provide a hint that userspace can
change the param at will.

Fixes: 8b905d28ee17 ("KVM: s390: provide kvm_arch_no_poll function")
Reviewed-by: Christian Borntraeger
Signed-off-by: Sean Christopherson
Message-Id:
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin

Sean Christopherson
2022-03-09 02:12:34 +0800

16 Feb, 2022

1 commit

d44af3ad2 KVM: eventfd: Fix false positive RCU usage warning ... Browse Code »

[ Upstream commit 6a0c61703e3a5d67845a4b275e1d9d7bc1b5aad7 ]

Fix the following false positive warning:
=============================
WARNING: suspicious RCU usage
5.16.0-rc4+ #57 Not tainted
-----------------------------
arch/x86/kvm/../../../virt/kvm/eventfd.c:484 RCU-list traversed in non-reader section!!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
3 locks held by fc_vcpu 0/330:
#0: ffff8884835fc0b0 (&vcpu->mutex){+.+.}-{3:3}, at: kvm_vcpu_ioctl+0x88/0x6f0 [kvm]
#1: ffffc90004c0bb68 (&kvm->srcu){....}-{0:0}, at: vcpu_enter_guest+0x600/0x1860 [kvm]
#2: ffffc90004c0c1d0 (&kvm->irq_srcu){....}-{0:0}, at: kvm_notify_acked_irq+0x36/0x180 [kvm]

stack backtrace:
CPU: 26 PID: 330 Comm: fc_vcpu 0 Not tainted 5.16.0-rc4+
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Call Trace:

dump_stack_lvl+0x44/0x57
kvm_notify_acked_gsi+0x6b/0x70 [kvm]
kvm_notify_acked_irq+0x8d/0x180 [kvm]
kvm_ioapic_update_eoi+0x92/0x240 [kvm]
kvm_apic_set_eoi_accelerated+0x2a/0xe0 [kvm]
handle_apic_eoi_induced+0x3d/0x60 [kvm_intel]
vmx_handle_exit+0x19c/0x6a0 [kvm_intel]
vcpu_enter_guest+0x66e/0x1860 [kvm]
kvm_arch_vcpu_ioctl_run+0x438/0x7f0 [kvm]
kvm_vcpu_ioctl+0x38a/0x6f0 [kvm]
__x64_sys_ioctl+0x89/0xc0
do_syscall_64+0x3a/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

Since kvm_unregister_irq_ack_notifier() does synchronize_srcu(&kvm->irq_srcu),
kvm->irq_ack_notifier_list is protected by kvm->irq_srcu. In fact,
kvm->irq_srcu SRCU read lock is held in kvm_notify_acked_irq(), making it
a false positive warning. So use hlist_for_each_entry_srcu() instead of
hlist_for_each_entry_rcu().

Reviewed-by: Sean Christopherson
Signed-off-by: Hou Wenlong
Message-Id:
Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin

Hou Wenlong
2022-02-16 19:56:16 +0800

02 Feb, 2022

1 commit

b17cb93dd Revert "KVM: SVM: avoid infinite loop on NPF from bad address" ... Browse Code »

commit 31c25585695abdf03d6160aa6d829e855b256329 upstream.

Revert a completely broken check on an "invalid" RIP in SVM's workaround
for the DecodeAssists SMAP errata. kvm_vcpu_gfn_to_memslot() obviously
expects a gfn, i.e. operates in the guest physical address space, whereas
RIP is a virtual (not even linear) address. The "fix" worked for the
problematic KVM selftest because the test identity mapped RIP.

Fully revert the hack instead of trying to translate RIP to a GPA, as the
non-SEV case is now handled earlier, and KVM cannot access guest page
tables to translate RIP.

This reverts commit e72436bc3a5206f95bb384e741154166ddb3202e.

Fixes: e72436bc3a52 ("KVM: SVM: avoid infinite loop on NPF from bad address")
Reported-by: Liam Merwick
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson
Reviewed-by: Liam Merwick
Message-Id:
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Sean Christopherson
2022-02-02 00:27:01 +0800

22 Dec, 2021

1 commit

54bf0b0d3 KVM: downgrade two BUG_ONs to WARN_ON_ONCE ... Browse Code »

[ Upstream commit 5f25e71e311478f9bb0a8ef49e7d8b95316491d7 ]

This is not an unrecoverable situation. Users of kvm_read_guest_offset_cached
and kvm_write_guest_offset_cached must expect the read/write to fail, and
therefore it is possible to just return early with an error value.

Signed-off-by: Paolo Bonzini
Signed-off-by: Sasha Levin

Paolo Bonzini
2021-12-22 16:32:34 +0800

08 Dec, 2021

2 commits

cbe4fcf37 KVM: Ensure local memslot copies operate on up-to-date arch-specific data ... Browse Code »

commit bda44d844758c70c8dc1478e6fc9c25efa90c5a7 upstream.

When modifying memslots, snapshot the "old" memslot and copy it to the
"new" memslot's arch data after (re)acquiring slots_arch_lock. x86 can
change a memslot's arch data while memslot updates are in-progress so
long as it holds slots_arch_lock, thus snapshotting a memslot without
holding the lock can result in the consumption of stale data.

Fixes: b10a038e84d1 ("KVM: mmu: Add slots_arch_lock for memslot arch fields")
Cc: stable@vger.kernel.org
Cc: Ben Gardon
Signed-off-by: Sean Christopherson
Message-Id:
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Sean Christopherson
2021-12-08 16:04:43 +0800
0827b8db5 KVM: Disallow user memslot with size that exceeds "unsigned long" ... Browse Code »

commit 6b285a5587506bae084cf9a3ed5aa491d623b91b upstream.

Reject userspace memslots whose size exceeds the storage capacity of an
"unsigned long". KVM's uAPI takes the size as u64 to support large slots
on 64-bit hosts, but does not account for the size being truncated on
32-bit hosts in various flows. The access_ok() check on the userspace
virtual address in particular casts the size to "unsigned long" and will
check the wrong number of bytes.

KVM doesn't actually support slots whose size doesn't fit in an "unsigned
long", e.g. KVM's internal kvm_memory_slot.npages is an "unsigned long",
not a "u64", and misc arch specific code follows that behavior.

Fixes: fa3d315a4ce2 ("KVM: Validate userspace_addr of memslot when registered")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson
Reviewed-by: Maciej S. Szmigiero
Message-Id:
Signed-off-by: Paolo Bonzini
Signed-off-by: Greg Kroah-Hartman

Sean Christopherson
2021-12-08 16:04:43 +0800

23 Sep, 2021

1 commit

6bc6db000 KVM: Remove tlbs_dirty ... Browse Code »

There is no user of tlbs_dirty.

Signed-off-by: Lai Jiangshan
Signed-off-by: Paolo Bonzini
Message-Id:
Signed-off-by: Paolo Bonzini

Lai Jiangshan
2021-09-23 23:01:12 +0800

22 Sep, 2021

3 commits

0bbc2ca85 KVM: KVM: Use cpumask_available() to check for NULL cpumask when kicking vCPUs ... Browse Code »

Check for a NULL cpumask_var_t when kicking multiple vCPUs via
cpumask_available(), which performs a !NULL check if and only if cpumasks
are configured to be allocated off-stack. This is a meaningless
optimization, e.g. avoids a TEST+Jcc and TEST+CMOV on x86, but more
importantly helps document that the NULL check is necessary even though
all callers pass in a local variable.

No functional change intended.

Cc: Lai Jiangshan
Signed-off-by: Sean Christopherson
Signed-off-by: Vitaly Kuznetsov
Message-Id:
Signed-off-by: Paolo Bonzini

Sean Christopherson
2021-09-22 22:33:15 +0800
85b640450 KVM: Clean up benign vcpu->cpu data races when kicking vCPUs ... Browse Code »

Fix a benign data race reported by syzbot+KCSAN[*] by ensuring vcpu->cpu
is read exactly once, and by ensuring the vCPU is booted from guest mode
if kvm_arch_vcpu_should_kick() returns true. Fix a similar race in
kvm_make_vcpus_request_mask() by ensuring the vCPU is interrupted if
kvm_request_needs_ipi() returns true.

Reading vcpu->cpu before vcpu->mode (via kvm_arch_vcpu_should_kick() or
kvm_request_needs_ipi()) means the target vCPU could get migrated (change
vcpu->cpu) and enter !OUTSIDE_GUEST_MODE between reading vcpu->cpud and
reading vcpu->mode. If that happens, the kick/IPI will be sent to the
old pCPU, not the new pCPU that is now running the vCPU or reading SPTEs.

Although failing to kick the vCPU is not exactly ideal, practically
speaking it cannot cause a functional issue unless there is also a bug in
the caller, and any such bug would exist regardless of kvm_vcpu_kick()'s
behavior.

The purpose of sending an IPI is purely to get a vCPU into the host (or
out of reading SPTEs) so that the vCPU can recognize a change in state,
e.g. a KVM_REQ_* request. If vCPU's handling of the state change is
required for correctness, KVM must ensure either the vCPU sees the change
before entering the guest, or that the sender sees the vCPU as running in
guest mode. All architectures handle this by (a) sending the request
before calling kvm_vcpu_kick() and (b) checking for requests _after_
setting vcpu->mode.

x86's READING_SHADOW_PAGE_TABLES has similar requirements; KVM needs to
ensure it kicks and waits for vCPUs that started reading SPTEs _before_
MMU changes were finalized, but any vCPU that starts reading after MMU
changes were finalized will see the new state and can continue on
uninterrupted.

For uses of kvm_vcpu_kick() that are not paired with a KVM_REQ_*, e.g.
x86's kvm_arch_sync_dirty_log(), the order of the kick must not be relied
upon for functional correctness, e.g. in the dirty log case, userspace
cannot assume it has a 100% complete log if vCPUs are still running.

All that said, eliminate the benign race since the cost of doing so is an
"extra" atomic cmpxchg() in the case where the target vCPU is loaded by
the current pCPU or is not loaded at all. I.e. the kick will be skipped
due to kvm_vcpu_exiting_guest_mode() seeing a compatible vcpu->mode as
opposed to the kick being skipped because of the cpu checks.

Keep the "cpu != me" checks even though they appear useless/impossible at
first glance. x86 processes guest IPI writes in a fast path that runs in
IN_GUEST_MODE, i.e. can call kvm_vcpu_kick() from IN_GUEST_MODE. And
calling kvm_vm_bugged()->kvm_make_vcpus_request_mask() from IN_GUEST or
READING_SHADOW_PAGE_TABLES is perfectly reasonable.

Note, a race with the cpu_online() check in kvm_vcpu_kick() likely
persists, e.g. the vCPU could exit guest mode and get offlined between
the cpu_online() check and the sending of smp_send_reschedule(). But,
the online check appears to exist only to avoid a WARN in x86's
native_smp_send_reschedule() that fires if the target CPU is not online.
The reschedule WARN exists because CPU offlining takes the CPU out of the
scheduling pool, i.e. the WARN is intended to detect the case where the
kernel attempts to schedule a task on an offline CPU. The actual sending
of the IPI is a non-issue as at worst it will simpy be dropped on the
floor. In other words, KVM's usurping of the reschedule IPI could
theoretically trigger a WARN if the stars align, but there will be no
loss of functionality.

[*] https://syzkaller.appspot.com/bug?extid=cd4154e502f43f10808a

Cc: Venkatesh Srinivas
Cc: Vitaly Kuznetsov
Fixes: 97222cc83163 ("KVM: Emulate local APIC in kernel")
Signed-off-by: Sean Christopherson
Signed-off-by: Vitaly Kuznetsov
Message-Id:
Signed-off-by: Paolo Bonzini

Sean Christopherson
2021-09-22 22:33:15 +0800
ae232ea46 KVM: do not shrink halt_poll_ns below grow_start ... Browse Code »

grow_halt_poll_ns() ignores values between 0 and
halt_poll_ns_grow_start (10000 by default). However,
when we shrink halt_poll_ns we may fall way below
halt_poll_ns_grow_start and endup with halt_poll_ns
values that don't make a lot of sense: like 1 or 9,
or 19.

VCPU1 trace (halt_poll_ns_shrink equals 2):

VCPU1 grow 10000
VCPU1 shrink 5000
VCPU1 shrink 2500
VCPU1 shrink 1250
VCPU1 shrink 625
VCPU1 shrink 312
VCPU1 shrink 156
VCPU1 shrink 78
VCPU1 shrink 39
VCPU1 shrink 19
VCPU1 shrink 9
VCPU1 shrink 4

Mirror what grow_halt_poll_ns() does and set halt_poll_ns
to 0 as soon as new shrink-ed halt_poll_ns value falls
below halt_poll_ns_grow_start.

Signed-off-by: Sergey Senozhatsky
Signed-off-by: Paolo Bonzini
Message-Id:
Signed-off-by: Paolo Bonzini

Sergey Senozhatsky
2021-09-22 22:33:10 +0800

06 Sep, 2021

4 commits

109bbba50 KVM: Drop unused kvm_dirty_gfn_invalid() ... Browse Code »

Drop the unused function as reported by test bot.

Reported-by: kernel test robot
Signed-off-by: Peter Xu
Message-Id:
Signed-off-by: Paolo Bonzini

Peter Xu
2021-09-06 20:23:46 +0800
e99314a34 Merge tag 'kvmarm-5.15' of git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD ... Browse Code »

KVM/arm64 updates for 5.15

- Page ownership tracking between host EL1 and EL2

- Rely on userspace page tables to create large stage-2 mappings

- Fix incompatibility between pKVM and kmemleak

- Fix the PMU reset state, and improve the performance of the virtual PMU

- Move over to the generic KVM entry code

- Address PSCI reset issues w.r.t. save/restore

- Preliminary rework for the upcoming pKVM fixed feature

- A bunch of MM cleanups

- a vGIC fix for timer spurious interrupts

- Various cleanups

Paolo Bonzini
2021-09-06 18:34:48 +0800
3cc4e148b KVM: stats: Add VM stat for remote tlb flush requests ... Browse Code »

Add a new stat that counts the number of times a remote TLB flush is
requested, regardless of whether it kicks vCPUs out of guest mode. This
allows us to look at how often flushes are initiated.

Unlike remote_tlb_flush, this one applies to ARM's instruction-set-based
TLB flush implementation, so apply it there too.

Original-by: David Matlack
Signed-off-by: Jing Zhang
Message-Id:
Signed-off-by: Paolo Bonzini

Jing Zhang
2021-09-06 18:30:45 +0800
fdde13c13 KVM: Remove unnecessary export of kvm_{inc,dec}_notifier_count() ... Browse Code »

Don't export KVM's MMU notifier count helpers, under no circumstance
should any downstream module, including x86's vendor code, have a
legitimate reason to piggyback KVM's MMU notifier logic. E.g in the x86
case, only KVM's MMU should be elevating the notifier count, and that
code is always built into the core kvm.ko module.

Fixes: edb298c663fc ("KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range")
Cc: Maxim Levitsky
Signed-off-by: Sean Christopherson
Reviewed-by: Maxim Levitsky
Message-Id:
Signed-off-by: Paolo Bonzini

Sean Christopherson
2021-09-06 18:21:29 +0800

21 Aug, 2021

3 commits

8ccba534a KVM: stats: Add halt polling related histogram stats ... Browse Code »

Add three log histogram stats to record the distribution of time spent
on successful polling, failed polling and VCPU wait.
halt_poll_success_hist: Distribution of spent time for a successful poll.
halt_poll_fail_hist: Distribution of spent time for a failed poll.
halt_wait_hist: Distribution of time a VCPU has spent on waiting.

Signed-off-by: Jing Zhang
Message-Id:
Signed-off-by: Paolo Bonzini

Jing Zhang
2021-08-21 04:06:33 +0800
87bcc5fa0 KVM: stats: Add halt_wait_ns stats for all architectures ... Browse Code »

Add simple stats halt_wait_ns to record the time a VCPU has spent on
waiting for all architectures (not just powerpc).

Signed-off-by: Jing Zhang
Message-Id:
Signed-off-by: Paolo Bonzini

Jing Zhang
2021-08-21 04:06:33 +0800
edb298c66 KVM: x86/mmu: bump mmu notifier count in kvm_zap_gfn_range ... Browse Code »

This together with previous patch, ensures that
kvm_zap_gfn_range doesn't race with page fault
running on another vcpu, and will make this page fault code
retry instead.

This is based on a patch suggested by Sean Christopherson:
https://lkml.org/lkml/2021/7/22/1025

Suggested-by: Sean Christopherson
Signed-off-by: Maxim Levitsky
Message-Id:
Signed-off-by: Paolo Bonzini

Maxim Levitsky
2021-08-21 04:06:19 +0800

13 Aug, 2021

2 commits

3165af738 KVM: Allow to have arch-specific per-vm debugfs files ... Browse Code »

Allow archs to create arch-specific nodes under kvm->debugfs_dentry directory
besides the stats fields. The new interface kvm_arch_create_vm_debugfs() is
defined but not yet used. It's called after kvm->debugfs_dentry is created, so
it can be referenced directly in kvm_arch_create_vm_debugfs(). Arch should
define their own versions when they want to create extra debugfs nodes.

Signed-off-by: Peter Xu
Message-Id:
Signed-off-by: Paolo Bonzini

Peter Xu
2021-08-13 15:35:17 +0800
ee3b6e41b KVM: stats: remove dead stores ... Browse Code »

These stores are copied and pasted from the "if" statements above.
They are dead and while they are not really a bug, they can be
confusing to anyone reading the code as well. Remove them.

Reported-by: kernel test robot
Signed-off-by: Paolo Bonzini

Paolo Bonzini
2021-08-13 15:35:15 +0800

11 Aug, 2021

1 commit

c3e9434c9 Merge branch 'kvm-vmx-secctl' into HEAD ... Browse Code »

Merge common topic branch for 5.14-rc6 and 5.15 merge window.

Paolo Bonzini
2021-08-11 01:45:26 +0800

06 Aug, 2021

2 commits

fe22ed827 KVM: Cache the last used slot index per vCPU ... Browse Code »

The memslot for a given gfn is looked up multiple times during page
fault handling. Avoid binary searching for it multiple times by caching
the most recently used slot. There is an existing VM-wide last_used_slot
but that does not work well for cases where vCPUs are accessing memory
in different slots (see performance data below).

Another benefit of caching the most recently use slot (versus looking
up the slot once and passing around a pointer) is speeding up memslot
lookups *across* faults and during spte prefetching.

To measure the performance of this change I ran dirty_log_perf_test with
64 vCPUs and 64 memslots and measured "Populate memory time" and
"Iteration 2 dirty memory time". Tests were ran with eptad=N to force
dirty logging to use fast_page_fault so its performance could be
measured.

Config | Metric | Before | After
---------- | ----------------------------- | ------ | ------
tdp_mmu=Y | Populate memory time | 6.76s | 5.47s
tdp_mmu=Y | Iteration 2 dirty memory time | 2.83s | 0.31s
tdp_mmu=N | Populate memory time | 20.4s | 18.7s
tdp_mmu=N | Iteration 2 dirty memory time | 2.65s | 0.30s

The "Iteration 2 dirty memory time" results are especially compelling
because they are equivalent to running the same test with a single
memslot. In other words, fast_page_fault performance no longer scales
with the number of memslots.

Signed-off-by: David Matlack
Message-Id:
Signed-off-by: Paolo Bonzini

David Matlack
2021-08-06 19:52:29 +0800
87689270b KVM: Rename lru_slot to last_used_slot ... Browse Code »

lru_slot is used to keep track of the index of the most-recently used
memslot. The correct acronym would be "mru" but that is not a common
acronym. So call it last_used_slot which is a bit more obvious.

Suggested-by: Paolo Bonzini
Signed-off-by: David Matlack
Message-Id:
Signed-off-by: Paolo Bonzini

David Matlack
2021-08-06 19:52:28 +0800

04 Aug, 2021

1 commit

85cd39af1 KVM: Do not leak memory for duplicate debugfs directories ... Browse Code »

KVM creates a debugfs directory for each VM in order to store statistics
about the virtual machine. The directory name is built from the process
pid and a VM fd. While generally unique, it is possible to keep a
file descriptor alive in a way that causes duplicate directories, which
manifests as these messages:

[ 471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present!

Even though this should not happen in practice, it is more or less
expected in the case of KVM for testcases that call KVM_CREATE_VM and
close the resulting file descriptor repeatedly and in parallel.

When this happens, debugfs_create_dir() returns an error but
kvm_create_vm_debugfs() goes on to allocate stat data structs which are
later leaked. The slow memory leak was spotted by syzkaller, where it
caused OOM reports.

Since the issue only affects debugfs, do a lookup before calling
debugfs_create_dir, so that the message is downgraded and rate-limited.
While at it, ensure kvm->debugfs_dentry is NULL rather than an error
if it is not created. This fixes kvm_destroy_vm_debugfs, which was not
checking IS_ERR_OR_NULL correctly.

Cc: stable@vger.kernel.org
Fixes: 536a6f88c49d ("KVM: Create debugfs dir and stat files for each VM")
Reported-by: Alexey Kardashevskiy
Suggested-by: Greg Kroah-Hartman
Acked-by: Greg Kroah-Hartman
Signed-off-by: Paolo Bonzini

Paolo Bonzini
2021-08-04 18:02:03 +0800

03 Aug, 2021

2 commits

071064f14 KVM: Don't take mmu_lock for range invalidation unless necessary ... Browse Code »

Avoid taking mmu_lock for .invalidate_range_{start,end}() notifications
that are unrelated to KVM. This is possible now that memslot updates are
blocked from range_start() to range_end(); that ensures that lock elision
happens in both or none, and therefore that mmu_notifier_count updates
(which must occur while holding mmu_lock for write) are always paired
across start->end.

Based on patches originally written by Ben Gardon.

Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Paolo Bonzini
2021-08-03 15:54:08 +0800
52ac8b358 KVM: Block memslot updates across range_start() and range_end() ... Browse Code »

We would like to avoid taking mmu_lock for .invalidate_range_{start,end}()
notifications that are unrelated to KVM. Because mmu_notifier_count
must be modified while holding mmu_lock for write, and must always
be paired across start->end to stay balanced, lock elision must
happen in both or none. Therefore, in preparation for this change,
this patch prevents memslot updates across range_start() and range_end().

Note, technically flag-only memslot updates could be allowed in parallel,
but stalling a memslot update for a relatively short amount of time is
not a scalability issue, and this is all more than complex enough.

A long note on the locking: a previous version of the patch used an rwsem
to block the memslot update while the MMU notifier run, but this resulted
in the following deadlock involving the pseudo-lock tagged as
"mmu_notifier_invalidate_range_start".

======================================================
WARNING: possible circular locking dependency detected
5.12.0-rc3+ #6 Tainted: G OE
------------------------------------------------------
qemu-system-x86/3069 is trying to acquire lock:
ffffffff9c775ca0 (mmu_notifier_invalidate_range_start){+.+.}-{0:0}, at: __mmu_notifier_invalidate_range_end+0x5/0x190

but task is already holding lock:
ffffaff7410a9160 (&kvm->mmu_notifier_slots_lock){.+.+}-{3:3}, at: kvm_mmu_notifier_invalidate_range_start+0x36d/0x4f0 [kvm]

which lock already depends on the new lock.

This corresponds to the following MMU notifier logic:

invalidate_range_start
take pseudo lock
down_read() (*)
release pseudo lock
invalidate_range_end
take pseudo lock (**)
up_read()
release pseudo lock

At point (*) we take the mmu_notifiers_slots_lock inside the pseudo lock;
at point (**) we take the pseudo lock inside the mmu_notifiers_slots_lock.

This could cause a deadlock (ignoring for a second that the pseudo lock
is not a lock):

- invalidate_range_start waits on down_read(), because the rwsem is
held by install_new_memslots

- install_new_memslots waits on down_write(), because the rwsem is
held till (another) invalidate_range_end finishes

- invalidate_range_end sits waits on the pseudo lock, held by
invalidate_range_start.

Removing the fairness of the rwsem breaks the cycle (in lockdep terms,
it would change the *shared* rwsem readers into *shared recursive*
readers), so open-code the wait using a readers count and a
spinlock. This also allows handling blockable and non-blockable
critical section in the same way.

Losing the rwsem fairness does theoretically allow MMU notifiers to
block install_new_memslots forever. Note that mm/mmu_notifier.c's own
retry scheme in mmu_interval_read_begin also uses wait/wake_up
and is likewise not fair.

Signed-off-by: Paolo Bonzini

Paolo Bonzini
2021-08-03 15:44:03 +0800

02 Aug, 2021

4 commits

605c71302 KVM: Introduce kvm_get_kvm_safe() ... Browse Code »

Introduce this safe version of kvm_get_kvm() so that it can be called even
during vm destruction. Use it in kvm_debugfs_open() and remove the verbose
comment. Prepare to be used elsewhere.

Signed-off-by: Peter Xu
Message-Id:
[Preserve the comment in kvm_debugfs_open. - Paolo]
Signed-off-by: Paolo Bonzini

Peter Xu
2021-08-02 23:01:46 +0800
0b8f11737 KVM: Add infrastructure and macro to mark VM as bugged ... Browse Code »

Signed-off-by: Sean Christopherson
Signed-off-by: Isaku Yamahata
Reviewed-by: Paolo Bonzini
Message-Id:
Signed-off-by: Paolo Bonzini

Sean Christopherson
2021-08-02 21:36:35 +0800
36c3ce6c0 KVM: Get rid of kvm_get_pfn() ... Browse Code »

Nobody is using kvm_get_pfn() anymore. Get rid of it.

Acked-by: Paolo Bonzini
Signed-off-by: Marc Zyngier
Link: https://lore.kernel.org/r/20210726153552.1535838-7-maz@kernel.org

Marc Zyngier
2021-08-02 21:05:58 +0800
205d76ff0 KVM: Remove kvm_is_transparent_hugepage() and PageTransCompoundMap() ... Browse Code »

Now that arm64 has stopped using kvm_is_transparent_hugepage(),
we can remove it, as well as PageTransCompoundMap() which was
only used by the former.

Acked-by: Paolo Bonzini
Signed-off-by: Marc Zyngier
Link: https://lore.kernel.org/r/20210726153552.1535838-5-maz@kernel.org

Marc Zyngier
2021-08-02 21:05:58 +0800

28 Jul, 2021

1 commit

8750f9bbd KVM: add missing compat KVM_CLEAR_DIRTY_LOG ... Browse Code »

The arguments to the KVM_CLEAR_DIRTY_LOG ioctl include a pointer,
therefore it needs a compat ioctl implementation. Otherwise,
32-bit userspace fails to invoke it on 64-bit kernels; for x86
it might work fine by chance if the padding is zero, but not
on big-endian architectures.

Reported-by: Thomas Sattler
Cc: stable@vger.kernel.org
Fixes: 2a31b9db1535 ("kvm: introduce manual dirty log reprotect")
Reviewed-by: Peter Xu
Signed-off-by: Paolo Bonzini

Paolo Bonzini
2021-07-28 04:59:01 +0800