Eric Lee / smarc-fsl-linux-kernel

04 Aug, 2021

1 commit

85cd39af1 KVM: Do not leak memory for duplicate debugfs directories ... Browse Code »

KVM creates a debugfs directory for each VM in order to store statistics
about the virtual machine. The directory name is built from the process
pid and a VM fd. While generally unique, it is possible to keep a
file descriptor alive in a way that causes duplicate directories, which
manifests as these messages:

[ 471.846235] debugfs: Directory '20245-4' with parent 'kvm' already present!

Even though this should not happen in practice, it is more or less
expected in the case of KVM for testcases that call KVM_CREATE_VM and
close the resulting file descriptor repeatedly and in parallel.

When this happens, debugfs_create_dir() returns an error but
kvm_create_vm_debugfs() goes on to allocate stat data structs which are
later leaked. The slow memory leak was spotted by syzkaller, where it
caused OOM reports.

Since the issue only affects debugfs, do a lookup before calling
debugfs_create_dir, so that the message is downgraded and rate-limited.
While at it, ensure kvm->debugfs_dentry is NULL rather than an error
if it is not created. This fixes kvm_destroy_vm_debugfs, which was not
checking IS_ERR_OR_NULL correctly.

Cc: stable@vger.kernel.org
Fixes: 536a6f88c49d ("KVM: Create debugfs dir and stat files for each VM")
Reported-by: Alexey Kardashevskiy
Suggested-by: Greg Kroah-Hartman
Acked-by: Greg Kroah-Hartman
Signed-off-by: Paolo Bonzini

Paolo Bonzini
2021-08-04 18:02:03 +0800

28 Jul, 2021

2 commits

8750f9bbd KVM: add missing compat KVM_CLEAR_DIRTY_LOG ... Browse Code »

The arguments to the KVM_CLEAR_DIRTY_LOG ioctl include a pointer,
therefore it needs a compat ioctl implementation. Otherwise,
32-bit userspace fails to invoke it on 64-bit kernels; for x86
it might work fine by chance if the padding is zero, but not
on big-endian architectures.

Reported-by: Thomas Sattler
Cc: stable@vger.kernel.org
Fixes: 2a31b9db1535 ("kvm: introduce manual dirty log reprotect")
Reviewed-by: Peter Xu
Signed-off-by: Paolo Bonzini

Paolo Bonzini
2021-07-28 04:59:01 +0800
747756543 KVM: use cpu_relax when halt polling ... Browse Code »

SMT siblings share caches and other hardware, and busy halt polling
will degrade its sibling performance if its sibling is working

Sean Christopherson suggested as below:

"Rather than disallowing halt-polling entirely, on x86 it should be
sufficient to simply have the hardware thread yield to its sibling(s)
via PAUSE. It probably won't get back all performance, but I would
expect it to be close.
This compiles on all KVM architectures, and AFAICT the intended usage
of cpu_relax() is identical for all architectures."

Suggested-by: Sean Christopherson
Signed-off-by: Li RongQing
Message-Id:
Signed-off-by: Paolo Bonzini

Li RongQing
2021-07-28 04:59:01 +0800

16 Jul, 2021

1 commit

405386b02 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull kvm fixes from Paolo Bonzini:

- Allow again loading KVM on 32-bit non-PAE builds

- Fixes for host SMIs on AMD

- Fixes for guest SMIs on AMD

- Fixes for selftests on s390 and ARM

- Fix memory leak

- Enforce no-instrumentation area on vmentry when hardware breakpoints
are in use.

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (25 commits)
KVM: selftests: smm_test: Test SMM enter from L2
KVM: nSVM: Restore nested control upon leaving SMM
KVM: nSVM: Fix L1 state corruption upon return from SMM
KVM: nSVM: Introduce svm_copy_vmrun_state()
KVM: nSVM: Check that VM_HSAVE_PA MSR was set before VMRUN
KVM: nSVM: Check the value written to MSR_VM_HSAVE_PA
KVM: SVM: Fix sev_pin_memory() error checks in SEV migration utilities
KVM: SVM: Return -EFAULT if copy_to_user() for SEV mig packet header fails
KVM: SVM: add module param to control the #SMI interception
KVM: SVM: remove INIT intercept handler
KVM: SVM: #SMI interception must not skip the instruction
KVM: VMX: Remove vmx_msr_index from vmx.h
KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run()
KVM: selftests: Address extra memslot parameters in vm_vaddr_alloc
kvm: debugfs: fix memory leak in kvm_create_vm_debugfs
KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on the SVM
KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio
KVM: SVM: Revert clearing of C-bit on GPA in #NPF handler
KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAs
KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDR
...

Linus Torvalds
2021-07-16 02:56:07 +0800

15 Jul, 2021

3 commits

004d62eb4 kvm: debugfs: fix memory leak in kvm_create_vm_debugfs ... Browse Code »

In commit bc9e9e672df9 ("KVM: debugfs: Reuse binary stats descriptors")
loop for filling debugfs_stat_data was copy-pasted 2 times, but
in the second loop pointers are saved over pointers allocated
in the first loop. All this causes is a memory leak, fix it.

Fixes: bc9e9e672df9 ("KVM: debugfs: Reuse binary stats descriptors")
Signed-off-by: Pavel Skripkin
Reviewed-by: Jing Zhang
Message-Id:
Reviewed-by: Jing Zhang
Signed-off-by: Paolo Bonzini

Pavel Skripkin
2021-07-15 22:19:40 +0800
23fa2e46a KVM: mmio: Fix use-after-free Read in kvm_vm_ioctl_unregister_coalesced_mmio ... Browse Code »

BUG: KASAN: use-after-free in kvm_vm_ioctl_unregister_coalesced_mmio+0x7c/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:183
Read of size 8 at addr ffff0000c03a2500 by task syz-executor083/4269

CPU: 5 PID: 4269 Comm: syz-executor083 Not tainted 5.10.0 #7
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x2d0 arch/arm64/kernel/stacktrace.c:132
show_stack+0x28/0x34 arch/arm64/kernel/stacktrace.c:196
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x110/0x164 lib/dump_stack.c:118
print_address_description+0x78/0x5c8 mm/kasan/report.c:385
__kasan_report mm/kasan/report.c:545 [inline]
kasan_report+0x148/0x1e4 mm/kasan/report.c:562
check_memory_region_inline mm/kasan/generic.c:183 [inline]
__asan_load8+0xb4/0xbc mm/kasan/generic.c:252
kvm_vm_ioctl_unregister_coalesced_mmio+0x7c/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:183
kvm_vm_ioctl+0xe30/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3755
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739
__invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220
el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670

Allocated by task 4269:
stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track mm/kasan/common.c:56 [inline]
__kasan_kmalloc+0xdc/0x120 mm/kasan/common.c:461
kasan_kmalloc+0xc/0x14 mm/kasan/common.c:475
kmem_cache_alloc_trace include/linux/slab.h:450 [inline]
kmalloc include/linux/slab.h:552 [inline]
kzalloc include/linux/slab.h:664 [inline]
kvm_vm_ioctl_register_coalesced_mmio+0x78/0x1cc arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:146
kvm_vm_ioctl+0x7e8/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3746
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739
__invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220
el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670

Freed by task 4269:
stack_trace_save+0x80/0xb8 kernel/stacktrace.c:121
kasan_save_stack mm/kasan/common.c:48 [inline]
kasan_set_track+0x38/0x6c mm/kasan/common.c:56
kasan_set_free_info+0x20/0x40 mm/kasan/generic.c:355
__kasan_slab_free+0x124/0x150 mm/kasan/common.c:422
kasan_slab_free+0x10/0x1c mm/kasan/common.c:431
slab_free_hook mm/slub.c:1544 [inline]
slab_free_freelist_hook mm/slub.c:1577 [inline]
slab_free mm/slub.c:3142 [inline]
kfree+0x104/0x38c mm/slub.c:4124
coalesced_mmio_destructor+0x94/0xa4 arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:102
kvm_iodevice_destructor include/kvm/iodev.h:61 [inline]
kvm_io_bus_unregister_dev+0x248/0x280 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:4374
kvm_vm_ioctl_unregister_coalesced_mmio+0x158/0x1ec arch/arm64/kvm/../../../virt/kvm/coalesced_mmio.c:186
kvm_vm_ioctl+0xe30/0x14c4 arch/arm64/kvm/../../../virt/kvm/kvm_main.c:3755
vfs_ioctl fs/ioctl.c:48 [inline]
__do_sys_ioctl fs/ioctl.c:753 [inline]
__se_sys_ioctl fs/ioctl.c:739 [inline]
__arm64_sys_ioctl+0xf88/0x131c fs/ioctl.c:739
__invoke_syscall arch/arm64/kernel/syscall.c:36 [inline]
invoke_syscall arch/arm64/kernel/syscall.c:48 [inline]
el0_svc_common arch/arm64/kernel/syscall.c:158 [inline]
do_el0_svc+0x120/0x290 arch/arm64/kernel/syscall.c:220
el0_svc+0x1c/0x28 arch/arm64/kernel/entry-common.c:367
el0_sync_handler+0x98/0x170 arch/arm64/kernel/entry-common.c:383
el0_sync+0x140/0x180 arch/arm64/kernel/entry.S:670

If kvm_io_bus_unregister_dev() return -ENOMEM, we already call kvm_iodevice_destructor()
inside this function to delete 'struct kvm_coalesced_mmio_dev *dev' from list
and free the dev, but kvm_iodevice_destructor() is called again, it will lead
the above issue.

Let's check the the return value of kvm_io_bus_unregister_dev(), only call
kvm_iodevice_destructor() if the return value is 0.

Cc: Paolo Bonzini
Cc: kvm@vger.kernel.org
Reported-by: Hulk Robot
Signed-off-by: Kefeng Wang
Message-Id:
Cc: stable@vger.kernel.org
Fixes: 5d3c4c79384a ("KVM: Stop looking for coalesced MMIO zones if the bus is destroyed", 2021-04-20)
Signed-off-by: Paolo Bonzini

Kefeng Wang
2021-07-15 00:17:56 +0800
f3cf80077 Merge tag 'kvm-s390-master-5.14-1' of git://git.kernel.org/pub/scm/linux/kernel/… ... Browse Code »

…git/kvms390/linux into HEAD

KVM: selftests: Fixes

- provide memory model for IBM z196 and zEC12
- do not require 64GB of memory

Paolo Bonzini
2021-07-15 00:14:27 +0800

30 Jun, 2021

2 commits

65090f30a Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc updates from Andrew Morton:
"191 patches.

Subsystems affected by this patch series: kthread, ia64, scripts,
ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
pagealloc, and memory-failure)"

* emailed patches from Andrew Morton : (191 commits)
mm,hwpoison: make get_hwpoison_page() call get_any_page()
mm,hwpoison: send SIGBUS with error virutal address
mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
docs: remove description of DISCONTIGMEM
arch, mm: remove stale mentions of DISCONIGMEM
mm: remove CONFIG_DISCONTIGMEM
m68k: remove support for DISCONTIGMEM
arc: remove support for DISCONTIGMEM
arc: update comment about HIGHMEM implementation
alpha: remove DISCONTIGMEM and NUMA
mm/page_alloc: move free_the_page
mm/page_alloc: fix counting of managed_pages
mm/page_alloc: improve memmap_pages dbg msg
mm: drop SECTION_SHIFT in code comments
mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
mm/page_alloc: scale the number of pages that are batch freed
...

Linus Torvalds
2021-06-30 08:29:11 +0800
fc98c03ba virt/kvm: use vma_lookup() instead of find_vma_intersection() ... Browse Code »

vma_lookup() finds the vma of a specific address with a cleaner interface
and is more readable.

Link: https://lkml.kernel.org/r/20210521174745.2219620-11-Liam.Howlett@Oracle.com
Signed-off-by: Liam R. Howlett
Reviewed-by: Laurent Dufour
Acked-by: David Hildenbrand
Acked-by: Davidlohr Bueso
Cc: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Liam Howlett
2021-06-30 01:53:51 +0800

29 Jun, 2021

2 commits

36824f198 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull kvm updates from Paolo Bonzini:
"This covers all architectures (except MIPS) so I don't expect any
other feature pull requests this merge window.

ARM:

- Add MTE support in guests, complete with tag save/restore interface

- Reduce the impact of CMOs by moving them in the page-table code

- Allow device block mappings at stage-2

- Reduce the footprint of the vmemmap in protected mode

- Support the vGIC on dumb systems such as the Apple M1

- Add selftest infrastructure to support multiple configuration and
apply that to PMU/non-PMU setups

- Add selftests for the debug architecture

- The usual crop of PMU fixes

PPC:

- Support for the H_RPT_INVALIDATE hypercall

- Conversion of Book3S entry/exit to C

- Bug fixes

S390:

- new HW facilities for guests

- make inline assembly more robust with KASAN and co

x86:

- Allow userspace to handle emulation errors (unknown instructions)

- Lazy allocation of the rmap (host physical -> guest physical
address)

- Support for virtualizing TSC scaling on VMX machines

- Optimizations to avoid shattering huge pages at the beginning of
live migration

- Support for initializing the PDPTRs without loading them from
memory

- Many TLB flushing cleanups

- Refuse to load if two-stage paging is available but NX is not (this
has been a requirement in practice for over a year)

- A large series that separates the MMU mode (WP/SMAP/SMEP etc.) from
CR0/CR4/EFER, using the MMU mode everywhere once it is computed
from the CPU registers

- Use PM notifier to notify the guest about host suspend or hibernate

- Support for passing arguments to Hyper-V hypercalls using XMM
registers

- Support for Hyper-V TLB flush hypercalls and enlightened MSR bitmap
on AMD processors

- Hide Hyper-V hypercalls that are not included in the guest CPUID

- Fixes for live migration of virtual machines that use the Hyper-V
"enlightened VMCS" optimization of nested virtualization

- Bugfixes (not many)

Generic:

- Support for retrieving statistics without debugfs

- Cleanups for the KVM selftests API"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (314 commits)
KVM: x86: rename apic_access_page_done to apic_access_memslot_enabled
kvm: x86: disable the narrow guest module parameter on unload
selftests: kvm: Allows userspace to handle emulation errors.
kvm: x86: Allow userspace to handle emulation errors
KVM: x86/mmu: Let guest use GBPAGES if supported in hardware and TDP is on
KVM: x86/mmu: Get CR4.SMEP from MMU, not vCPU, in shadow page fault
KVM: x86/mmu: Get CR0.WP from MMU, not vCPU, in shadow page fault
KVM: x86/mmu: Drop redundant rsvd bits reset for nested NPT
KVM: x86/mmu: Optimize and clean up so called "last nonleaf level" logic
KVM: x86: Enhance comments for MMU roles and nested transition trickiness
KVM: x86/mmu: WARN on any reserved SPTE value when making a valid SPTE
KVM: x86/mmu: Add helpers to do full reserved SPTE checks w/ generic MMU
KVM: x86/mmu: Use MMU's role to determine PTTYPE
KVM: x86/mmu: Collapse 32-bit PAE and 64-bit statements for helpers
KVM: x86/mmu: Add a helper to calculate root from role_regs
KVM: x86/mmu: Add helper to update paging metadata
KVM: x86/mmu: Don't update nested guest's paging bitmasks if CR0.PG=0
KVM: x86/mmu: Consolidate reset_rsvds_bits_mask() calls
KVM: x86/mmu: Use MMU role_regs to get LA57, and drop vCPU LA57 helper
KVM: x86/mmu: Get nested MMU's root level from the MMU's role
...

Linus Torvalds
2021-06-29 06:40:51 +0800
54a728dc5 Merge tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler udpates from Ingo Molnar:

- Changes to core scheduling facilities:

- Add "Core Scheduling" via CONFIG_SCHED_CORE=y, which enables
coordinated scheduling across SMT siblings. This is a much
requested feature for cloud computing platforms, to allow the
flexible utilization of SMT siblings, without exposing untrusted
domains to information leaks & side channels, plus to ensure more
deterministic computing performance on SMT systems used by
heterogenous workloads.

There are new prctls to set core scheduling groups, which allows
more flexible management of workloads that can share siblings.

- Fix task->state access anti-patterns that may result in missed
wakeups and rename it to ->__state in the process to catch new
abuses.

- Load-balancing changes:

- Tweak newidle_balance for fair-sched, to improve 'memcache'-like
workloads.

- "Age" (decay) average idle time, to better track & improve
workloads such as 'tbench'.

- Fix & improve energy-aware (EAS) balancing logic & metrics.

- Fix & improve the uclamp metrics.

- Fix task migration (taskset) corner case on !CONFIG_CPUSET.

- Fix RT and deadline utilization tracking across policy changes

- Introduce a "burstable" CFS controller via cgroups, which allows
bursty CPU-bound workloads to borrow a bit against their future
quota to improve overall latencies & batching. Can be tweaked via
/sys/fs/cgroup/cpu//cpu.cfs_burst_us.

- Rework assymetric topology/capacity detection & handling.

- Scheduler statistics & tooling:

- Disable delayacct by default, but add a sysctl to enable it at
runtime if tooling needs it. Use static keys and other
optimizations to make it more palatable.

- Use sched_clock() in delayacct, instead of ktime_get_ns().

- Misc cleanups and fixes.

* tag 'sched-core-2021-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (72 commits)
sched/doc: Update the CPU capacity asymmetry bits
sched/topology: Rework CPU capacity asymmetry detection
sched/core: Introduce SD_ASYM_CPUCAPACITY_FULL sched_domain flag
psi: Fix race between psi_trigger_create/destroy
sched/fair: Introduce the burstable CFS controller
sched/uclamp: Fix uclamp_tg_restrict()
sched/rt: Fix Deadline utilization tracking during policy change
sched/rt: Fix RT utilization tracking during policy change
sched: Change task_struct::state
sched,arch: Remove unused TASK_STATE offsets
sched,timer: Use __set_current_state()
sched: Add get_current_state()
sched,perf,kvm: Fix preemption condition
sched: Introduce task_is_running()
sched: Unbreak wakeups
sched/fair: Age the average idle time
sched/cpufreq: Consider reduced CPU capacity in energy calculation
sched/fair: Take thermal pressure into account while estimating energy
thermal/cpufreq_cooling: Update offline CPUs per-cpu thermal_pressure
sched/fair: Return early from update_tg_cfs_load() if delta == 0
...

Linus Torvalds
2021-06-29 03:14:19 +0800

25 Jun, 2021

3 commits

bc9e9e672 KVM: debugfs: Reuse binary stats descriptors ... Browse Code »

To remove code duplication, use the binary stats descriptors in the
implementation of the debugfs interface for statistics. This unifies
the definition of statistics for the binary and debugfs interfaces.

Signed-off-by: Jing Zhang
Message-Id:
Signed-off-by: Paolo Bonzini

Jing Zhang
2021-06-25 06:00:29 +0800
ce55c0494 KVM: stats: Support binary stats retrieval for a VCPU ... Browse Code »

Add a VCPU ioctl to get a statistics file descriptor by which a read
functionality is provided for userspace to read out VCPU stats header,
descriptors and data.
Define VCPU statistics descriptors and header for all architectures.

Reviewed-by: David Matlack
Reviewed-by: Ricardo Koller
Reviewed-by: Krish Sadhukhan
Reviewed-by: Fuad Tabba
Tested-by: Fuad Tabba #arm64
Signed-off-by: Jing Zhang
Message-Id:
Signed-off-by: Paolo Bonzini

Jing Zhang
2021-06-25 06:00:19 +0800
fcfe1baed KVM: stats: Support binary stats retrieval for a VM ... Browse Code »

Add a VM ioctl to get a statistics file descriptor by which a read
functionality is provided for userspace to read out VM stats header,
descriptors and data.
Define VM statistics descriptors and header for all architectures.

Reviewed-by: David Matlack
Reviewed-by: Ricardo Koller
Reviewed-by: Krish Sadhukhan
Reviewed-by: Fuad Tabba
Tested-by: Fuad Tabba #arm64
Signed-off-by: Jing Zhang
Message-Id:
Signed-off-by: Paolo Bonzini

Jing Zhang
2021-06-25 06:00:10 +0800

24 Jun, 2021

3 commits

f8be156be KVM: do not allow mapping valid but non-reference-counted pages ... Browse Code »

It's possible to create a region which maps valid but non-refcounted
pages (e.g., tail pages of non-compound higher order allocations). These
host pages can then be returned by gfn_to_page, gfn_to_pfn, etc., family
of APIs, which take a reference to the page, which takes it from 0 to 1.
When the reference is dropped, this will free the page incorrectly.

Fix this by only taking a reference on valid pages if it was non-zero,
which indicates it is participating in normal refcounting (and can be
released with put_page).

This addresses CVE-2021-22543.

Signed-off-by: Nicholas Piggin
Tested-by: Paolo Bonzini
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini

Nicholas Piggin
2021-06-24 23:55:11 +0800
cb082bfab KVM: stats: Add fd-based API to read binary stats data ... Browse Code »

This commit defines the API for userspace and prepare the common
functionalities to support per VM/VCPU binary stats data readings.

The KVM stats now is only accessible by debugfs, which has some
shortcomings this change series are supposed to fix:
1. The current debugfs stats solution in KVM could be disabled
when kernel Lockdown mode is enabled, which is a potential
rick for production.
2. The current debugfs stats solution in KVM is organized as "one
stats per file", it is good for debugging, but not efficient
for production.
3. The stats read/clear in current debugfs solution in KVM are
protected by the global kvm_lock.

Besides that, there are some other benefits with this change:
1. All KVM VM/VCPU stats can be read out in a bulk by one copy
to userspace.
2. A schema is used to describe KVM statistics. From userspace's
perspective, the KVM statistics are self-describing.
3. With the fd-based solution, a separate telemetry would be able
to read KVM stats in a less privileged environment.
4. After the initial setup by reading in stats descriptors, a
telemetry only needs to read the stats data itself, no more
parsing or setup is needed.

Reviewed-by: David Matlack
Reviewed-by: Ricardo Koller
Reviewed-by: Krish Sadhukhan
Reviewed-by: Fuad Tabba
Tested-by: Fuad Tabba #arm64
Signed-off-by: Jing Zhang
Message-Id:
Signed-off-by: Paolo Bonzini

Jing Zhang
2021-06-24 23:47:57 +0800
0193cc908 KVM: stats: Separate generic stats from architecture specific ones ... Browse Code »

Generic KVM stats are those collected in architecture independent code
or those supported by all architectures; put all generic statistics in
a separate structure. This ensures that they are defined the same way
in the statistics API which is being added, removing duplication among
different architectures in the declaration of the descriptors.

No functional change intended.

Reviewed-by: David Matlack
Reviewed-by: Ricardo Koller
Reviewed-by: Krish Sadhukhan
Signed-off-by: Jing Zhang
Message-Id:
Signed-off-by: Paolo Bonzini

Jing Zhang
2021-06-24 23:47:56 +0800

18 Jun, 2021

5 commits

3ba9f93b1 sched,perf,kvm: Fix preemption condition ... Browse Code »

When ran from the sched-out path (preempt_notifier or perf_event),
p->state is irrelevant to determine preemption. You can get preempted
with !task_is_running() just fine.

The right indicator for preemption is if the task is still on the
runqueue in the sched-out path.

Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Mark Rutland
Link: https://lore.kernel.org/r/20210611082838.285099381@infradead.org

Peter Zijlstra
2021-06-18 17:43:07 +0800
e3cb6fa0e KVM: switch per-VM stats to u64 ... Browse Code »

Make them the same type as vCPU stats. There is no reason
to limit the counters to unsigned long.

Signed-off-by: Paolo Bonzini

Paolo Bonzini
2021-06-18 02:25:27 +0800
2fdef3a2a kvm: add PM-notifier ... Browse Code »

Add KVM PM-notifier so that architectures can have arch-specific
VM suspend/resume routines. Such architectures need to select
CONFIG_HAVE_KVM_PM_NOTIFIER and implement kvm_arch_pm_notifier().

Signed-off-by: Sergey Senozhatsky
Acked-by: Marc Zyngier
Message-Id:
Signed-off-by: Paolo Bonzini

Sergey Senozhatsky
2021-06-18 01:09:32 +0800
b10a038e8 KVM: mmu: Add slots_arch_lock for memslot arch fields ... Browse Code »

Add a new lock to protect the arch-specific fields of memslots if they
need to be modified in a kvm->srcu read critical section. A future
commit will use this lock to lazily allocate memslot rmaps for x86.

Signed-off-by: Ben Gardon
Message-Id:
[Add Documentation/ hunk. - Paolo]
Signed-off-by: Paolo Bonzini

Ben Gardon
2021-06-18 01:09:26 +0800
ddc12f2a1 KVM: mmu: Refactor memslot copy ... Browse Code »

Factor out copying kvm_memslots from allocating the memory for new ones
in preparation for adding a new lock to protect the arch-specific fields
of the memslots.

No functional change intended.

Reviewed-by: David Hildenbrand
Signed-off-by: Ben Gardon
Message-Id:
Signed-off-by: Paolo Bonzini

Ben Gardon
2021-06-18 01:09:26 +0800

27 May, 2021

3 commits

a2486020a KVM: VMX: update vcpu posted-interrupt descriptor when assigning device ... Browse Code »

For VMX, when a vcpu enters HLT emulation, pi_post_block will:

1) Add vcpu to per-cpu list of blocked vcpus.

2) Program the posted-interrupt descriptor "notification vector"
to POSTED_INTR_WAKEUP_VECTOR

With interrupt remapping, an interrupt will set the PIR bit for the
vector programmed for the device on the CPU, test-and-set the
ON bit on the posted interrupt descriptor, and if the ON bit is clear
generate an interrupt for the notification vector.

This way, the target CPU wakes upon a device interrupt and wakes up
the target vcpu.

Problem is that pi_post_block only programs the notification vector
if kvm_arch_has_assigned_device() is true. Its possible for the
following to happen:

1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false,
notification vector is not programmed
2) device is assigned to VM
3) device interrupts vcpu V, sets ON bit
(notification vector not programmed, so pcpu P remains in idle)
4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set,
kvm_vcpu_kick is skipped
5) vcpu 0 busy spins on vcpu V's response for several seconds, until
RCU watchdog NMIs all vCPUs.

To fix this, use the start_assignment kvm_x86_ops callback to kick
vcpus out of the halt loop, so the notification vector is
properly reprogrammed to the wakeup vector.

Reported-by: Pei Zhang
Signed-off-by: Marcelo Tosatti
Message-Id:
Signed-off-by: Paolo Bonzini

Marcelo Tosatti
2021-05-27 19:58:23 +0800
084071d5e KVM: rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK ... Browse Code »

KVM_REQ_UNBLOCK will be used to exit a vcpu from
its inner vcpu halt emulation loop.

Rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK, switch
PowerPC to arch specific request bit.

Signed-off-by: Marcelo Tosatti

Message-Id:
Signed-off-by: Paolo Bonzini

Marcelo Tosatti
2021-05-27 19:57:38 +0800
6bd5b7436 KVM: PPC: exit halt polling on need_resched() ... Browse Code »

This is inspired by commit 262de4102c7bb8 (kvm: exit halt polling on
need_resched() as well). Due to PPC implements an arch specific halt
polling logic, we have to the need_resched() check there as well. This
patch adds a helper function that can be shared between book3s and generic
halt-polling loops.

Reviewed-by: David Matlack
Reviewed-by: Venkatesh Srinivas
Cc: Ben Segall
Cc: Venkatesh Srinivas
Cc: Jim Mattson
Cc: David Matlack
Cc: Paul Mackerras
Cc: Suraj Jitindar Singh
Signed-off-by: Wanpeng Li
Message-Id:
[Make the function inline. - Paolo]
Signed-off-by: Paolo Bonzini

Wanpeng Li
2021-05-27 19:45:50 +0800

17 May, 2021

1 commit

a4345a7ce Merge tag 'kvmarm-fixes-5.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git… ... Browse Code »

…/kvmarm/kvmarm into HEAD

KVM/arm64 fixes for 5.13, take #1

- Fix regression with irqbypass not restarting the guest on failed connect
- Fix regression with debug register decoding resulting in overlapping access
- Commit exception state on exit to usrspace
- Fix the MMU notifier return values
- Add missing 'static' qualifiers in the new host stage-2 code

Paolo Bonzini
2021-05-17 15:55:12 +0800

15 May, 2021

1 commit

e44b49f62 Revert "irqbypass: do not start cons/prod when failed connect" ... Browse Code »

This reverts commit a979a6aa009f3c99689432e0cdb5402a4463fb88.

The reverted commit may cause VM freeze on arm64 with GICv4,
where stopping a consumer is implemented by suspending the VM.
Should the connect fail, the VM will not be resumed, which
is a bit of a problem.

It also erroneously calls the producer destructor unconditionally,
which is unexpected.

Reported-by: Shaokun Zhang
Suggested-by: Marc Zyngier
Acked-by: Jason Wang
Acked-by: Michael S. Tsirkin
Reviewed-by: Eric Auger
Tested-by: Shaokun Zhang
Signed-off-by: Zhu Lingshan
[maz: tags and cc-stable, commit message update]
Signed-off-by: Marc Zyngier
Fixes: a979a6aa009f ("irqbypass: do not start cons/prod when failed connect")
Link: https://lore.kernel.org/r/3a2c66d6-6ca0-8478-d24b-61e8e3241b20@hisilicon.com
Link: https://lore.kernel.org/r/20210508071152.722425-1-lingshan.zhu@intel.com
Cc: stable@vger.kernel.org

Zhu Lingshan
2021-05-15 17:26:55 +0800

07 May, 2021

1 commit

258785ef0 kvm: Cap halt polling at kvm->max_halt_poll_ns ... Browse Code »

When growing halt-polling, there is no check that the poll time exceeds
the per-VM limit. It's possible for vcpu->halt_poll_ns to grow past
kvm->max_halt_poll_ns and stay there until a halt which takes longer
than kvm->halt_poll_ns.

Signed-off-by: David Matlack
Signed-off-by: Venkatesh Srinivas
Message-Id:
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini

David Matlack
2021-05-07 18:06:22 +0800

03 May, 2021

1 commit

262de4102 kvm: exit halt polling on need_resched() as well ... Browse Code »

single_task_running() is usually more general than need_resched()
but CFS_BANDWIDTH throttling will use resched_task() when there
is just one task to get the task to block. This was causing
long-need_resched warnings and was likely allowing VMs to
overrun their quota when halt polling.

Signed-off-by: Ben Segall
Signed-off-by: Venkatesh Srinivas
Message-Id:
Signed-off-by: Paolo Bonzini
Cc: stable@vger.kernel.org
Reviewed-by: Jim Mattson

Benjamin Segall
2021-05-03 23:25:36 +0800

22 Apr, 2021

2 commits

52acd22fa KVM: Boost vCPU candidate in user mode which is delivering interrupt ... Browse Code »

Both lock holder vCPU and IPI receiver that has halted are condidate for
boost. However, the PLE handler was originally designed to deal with the
lock holder preemption problem. The Intel PLE occurs when the spinlock
waiter is in kernel mode. This assumption doesn't hold for IPI receiver,
they can be in either kernel or user mode. the vCPU candidate in user mode
will not be boosted even if they should respond to IPIs. Some benchmarks
like pbzip2, swaptions etc do the TLB shootdown in kernel mode and most
of the time they are running in user mode. It can lead to a large number
of continuous PLE events because the IPI sender causes PLE events
repeatedly until the receiver is scheduled while the receiver is not
candidate for a boost.

This patch boosts the vCPU candidiate in user mode which is delivery
interrupt. We can observe the speed of pbzip2 improves 10% in 96 vCPUs
VM in over-subscribe scenario (The host machine is 2 socket, 48 cores,
96 HTs Intel CLX box). There is no performance regression for other
benchmarks like Unixbench spawn (most of the time contend read/write
lock in kernel mode), ebizzy (most of the time contend read/write sem
and TLB shoodtdown in kernel mode).

Signed-off-by: Wanpeng Li
Message-Id:
Signed-off-by: Paolo Bonzini

Wanpeng Li
2021-04-22 00:20:03 +0800
54526d1fd KVM: x86: Support KVM VMs sharing SEV context ... Browse Code »

Add a capability for userspace to mirror SEV encryption context from
one vm to another. On our side, this is intended to support a
Migration Helper vCPU, but it can also be used generically to support
other in-guest workloads scheduled by the host. The intention is for
the primary guest and the mirror to have nearly identical memslots.

The primary benefits of this are that:
1) The VMs do not share KVM contexts (think APIC/MSRs/etc), so they
can't accidentally clobber each other.
2) The VMs can have different memory-views, which is necessary for post-copy
migration (the migration vCPUs on the target need to read and write to
pages, when the primary guest would VMEXIT).

This does not change the threat model for AMD SEV. Any memory involved
is still owned by the primary guest and its initial state is still
attested to through the normal SEV_LAUNCH_* flows. If userspace wanted
to circumvent SEV, they could achieve the same effect by simply attaching
a vCPU to the primary VM.
This patch deliberately leaves userspace in charge of the memslots for the
mirror, as it already has the power to mess with them in the primary guest.

This patch does not support SEV-ES (much less SNP), as it does not
handle handing off attested VMSAs to the mirror.

For additional context, we need a Migration Helper because SEV PSP
migration is far too slow for our live migration on its own. Using
an in-guest migrator lets us speed this up significantly.

Signed-off-by: Nathan Tempelman
Message-Id:
Signed-off-by: Paolo Bonzini

Nathan Tempelman
2021-04-22 00:20:02 +0800

20 Apr, 2021

3 commits

7c896d375 KVM: Add proper lockdep assertion in I/O bus unregister ... Browse Code »

Convert a comment above kvm_io_bus_unregister_dev() into an actual
lockdep assertion, and opportunistically add curly braces to a multi-line
for-loop.

Signed-off-by: Sean Christopherson
Message-Id:
Signed-off-by: Paolo Bonzini

Sean Christopherson
2021-04-20 16:18:51 +0800
5d3c4c793 KVM: Stop looking for coalesced MMIO zones if the bus is destroyed ... Browse Code »

Abort the walk of coalesced MMIO zones if kvm_io_bus_unregister_dev()
fails to allocate memory for the new instance of the bus. If it can't
instantiate a new bus, unregister_dev() destroys all devices _except_ the
target device. But, it doesn't tell the caller that it obliterated the
bus and invoked the destructor for all devices that were on the bus. In
the coalesced MMIO case, this can result in a deleted list entry
dereference due to attempting to continue iterating on coalesced_zones
after future entries (in the walk) have been deleted.

Opportunistically add curly braces to the for-loop, which encompasses
many lines but sneaks by without braces due to the guts being a single
if statement.

Fixes: f65886606c2d ("KVM: fix memory leak in kvm_io_bus_unregister_dev()")
Cc: stable@vger.kernel.org
Reported-by: Hao Sun
Signed-off-by: Sean Christopherson
Message-Id:
Signed-off-by: Paolo Bonzini

Sean Christopherson
2021-04-20 16:18:51 +0800
2ee375742 KVM: Destroy I/O bus devices on unregister failure _after_ sync'ing SRCU ... Browse Code »

If allocating a new instance of an I/O bus fails when unregistering a
device, wait to destroy the device until after all readers are guaranteed
to see the new null bus. Destroying devices before the bus is nullified
could lead to use-after-free since readers expect the devices on their
reference of the bus to remain valid.

Fixes: f65886606c2d ("KVM: fix memory leak in kvm_io_bus_unregister_dev()")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson
Message-Id:
Signed-off-by: Paolo Bonzini

Sean Christopherson
2021-04-20 16:18:50 +0800

17 Apr, 2021

6 commits

8931a454a KVM: Take mmu_lock when handling MMU notifier iff the hva hits a memslot ... Browse Code »

Defer acquiring mmu_lock in the MMU notifier paths until a "hit" has been
detected in the memslots, i.e. don't take the lock for notifications that
don't affect the guest.

For small VMs, spurious locking is a minor annoyance. And for "volatile"
setups where the majority of notifications _are_ relevant, this barely
qualifies as an optimization.

But, for large VMs (hundreds of threads) with static setups, e.g. no
page migration, no swapping, etc..., the vast majority of MMU notifier
callbacks will be unrelated to the guest, e.g. will often be in response
to the userspace VMM adjusting its own virtual address space. In such
large VMs, acquiring mmu_lock can be painful as it blocks vCPUs from
handling page faults. In some scenarios it can even be "fatal" in the
sense that it causes unacceptable brownouts, e.g. when rebuilding huge
pages after live migration, a significant percentage of vCPUs will be
attempting to handle page faults.

x86's TDP MMU implementation is especially susceptible to spurious
locking due it taking mmu_lock for read when handling page faults.
Because rwlock is fair, a single writer will stall future readers, while
the writer is itself stalled waiting for in-progress readers to complete.
This is exacerbated by the MMU notifiers often firing multiple times in
quick succession, e.g. moving a page will (always?) invoke three separate
notifiers: .invalidate_range_start(), invalidate_range_end(), and
.change_pte(). Unnecessarily taking mmu_lock each time means even a
single spurious sequence can be problematic.

Note, this optimizes only the unpaired callbacks. Optimizing the
.invalidate_range_{start,end}() pairs is more complex and will be done in
a future patch.

Suggested-by: Ben Gardon
Signed-off-by: Sean Christopherson
Message-Id:
Signed-off-by: Paolo Bonzini

Sean Christopherson
2021-04-17 20:31:08 +0800
f922bd9bf KVM: Move MMU notifier's mmu_lock acquisition into common helper ... Browse Code »

Acquire and release mmu_lock in the __kvm_handle_hva_range() helper
instead of requiring the caller to do the same. This paves the way for
future patches to take mmu_lock if and only if an overlapping memslot is
found, without also having to introduce the on_lock() shenanigans used
to manipulate the notifier count and sequence.

No functional change intended.

Signed-off-by: Sean Christopherson
Message-Id:
Signed-off-by: Paolo Bonzini

Sean Christopherson
2021-04-17 20:31:08 +0800
b4c5936c4 KVM: Kill off the old hva-based MMU notifier callbacks ... Browse Code »

Yank out the hva-based MMU notifier APIs now that all architectures that
use the notifiers have moved to the gfn-based APIs.

No functional change intended.

Signed-off-by: Sean Christopherson
Message-Id:
Signed-off-by: Paolo Bonzini

Sean Christopherson
2021-04-17 20:31:08 +0800
3039bcc74 KVM: Move x86's MMU notifier memslot walkers to generic code ... Browse Code »

Move the hva->gfn lookup for MMU notifiers into common code. Every arch
does a similar lookup, and some arch code is all but identical across
multiple architectures.

In addition to consolidating code, this will allow introducing
optimizations that will benefit all architectures without incurring
multiple walks of the memslots, e.g. by taking mmu_lock if and only if a
relevant range exists in the memslots.

The use of __always_inline to avoid indirect call retpolines, as done by
x86, may also benefit other architectures.

Consolidating the lookups also fixes a wart in x86, where the legacy MMU
and TDP MMU each do their own memslot walks.

Lastly, future enhancements to the memslot implementation, e.g. to add an
interval tree to track host address, will need to touch far less arch
specific code.

MIPS, PPC, and arm64 will be converted one at a time in future patches.

Signed-off-by: Sean Christopherson
Message-Id:
Signed-off-by: Paolo Bonzini

Sean Christopherson
2021-04-17 20:31:06 +0800
c13fda237 KVM: Assert that notifier count is elevated in .change_pte() ... Browse Code »

In KVM's .change_pte() notification callback, replace the notifier
sequence bump with a WARN_ON assertion that the notifier count is
elevated. An elevated count provides stricter protections than bumping
the sequence, and the sequence is guarnateed to be bumped before the
count hits zero.

When .change_pte() was added by commit 828502d30073 ("ksm: add
mmu_notifier set_pte_at_notify()"), bumping the sequence was necessary
as .change_pte() would be invoked without any surrounding notifications.

However, since commit 6bdb913f0a70 ("mm: wrap calls to set_pte_at_notify
with invalidate_range_start and invalidate_range_end"), all calls to
.change_pte() are guaranteed to be surrounded by start() and end(), and
so are guaranteed to run with an elevated notifier count.

Note, wrapping .change_pte() with .invalidate_range_{start,end}() is a
bug of sorts, as invalidating the secondary MMU's (KVM's) PTE defeats
the purpose of .change_pte(). Every arch's kvm_set_spte_hva() assumes
.change_pte() is called when the relevant SPTE is present in KVM's MMU,
as the original goal was to accelerate Kernel Samepage Merging (KSM) by
updating KVM's SPTEs without requiring a VM-Exit (due to invalidating
the SPTE). I.e. it means that .change_pte() is effectively dead code
on _all_ architectures.

x86 and MIPS are clearcut nops if the old SPTE is not-present, and that
is guaranteed due to the prior invalidation. PPC simply unmaps the SPTE,
which again should be a nop due to the invalidation. arm64 is a bit
murky, but it's also likely a nop because kvm_pgtable_stage2_map() is
called without a cache pointer, which means it will map an entry if and
only if an existing PTE was found.

For now, take advantage of the bug to simplify future consolidation of
KVMs's MMU notifier code. Doing so will not greatly complicate fixing
.change_pte(), assuming it's even worth fixing. .change_pte() has been
broken for 8+ years and no one has complained. Even if there are
KSM+KVM users that care deeply about its performance, the benefits of
avoiding VM-Exits via .change_pte() need to be reevaluated to justify
the added complexity and testing burden. Ripping out .change_pte()
entirely would be a lot easier.

Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2021-04-17 20:31:06 +0800
85f479308 KVM: Explicitly use GFP_KERNEL_ACCOUNT for 'struct kvm_vcpu' allocations ... Browse Code »

Use GFP_KERNEL_ACCOUNT when allocating vCPUs to make it more obvious that
that the allocations are accounted, to make it easier to audit KVM's
allocations in the future, and to be consistent with other cache usage in
KVM.

When using SLAB/SLUB, this is a nop as the cache itself is created with
SLAB_ACCOUNT.

When using SLOB, there are caveats within caveats. SLOB doesn't honor
SLAB_ACCOUNT, so passing GFP_KERNEL_ACCOUNT will result in vCPU
allocations now being accounted. But, even that depends on internal
SLOB details as SLOB will only go to the page allocator when its cache is
depleted. That just happens to be extremely likely for vCPUs because the
size of kvm_vcpu is larger than the a page for almost all combinations of
architecture and page size. Whether or not the SLOB behavior is by
design is unknown; it's just as likely that no SLOB users care about
accounding and so no one has bothered to implemented support in SLOB.
Regardless, accounting vCPU allocations will not break SLOB+KVM+cgroup
users, if any exist.

Reviewed-by: Wanpeng Li
Acked-by: David Rientjes
Signed-off-by: Sean Christopherson
Message-Id:
Signed-off-by: Paolo Bonzini

Sean Christopherson
2021-04-17 20:31:04 +0800