Eric Lee / smarc-fsl-linux-kernel

08 Nov, 2020

1 commit

cc4cb0176 KVM: x86: use positive error values for msr emulation that causes #GP ... Browse Code »

Recent introduction of the userspace msr filtering added code that uses
negative error codes for cases that result in either #GP delivery to
the guest, or handled by the userspace msr filtering.

This breaks an assumption that a negative error code returned from the
msr emulation code is a semi-fatal error which should be returned
to userspace via KVM_RUN ioctl and usually kill the guest.

Fix this by reusing the already existing KVM_MSR_RET_INVALID error code,
and by adding a new KVM_MSR_RET_FILTERED error code for the
userspace filtered msrs.

Fixes: 291f35fb2c1d1 ("KVM: x86: report negative values from wrmsr emulation to userspace")
Reported-by: Qian Cai
Signed-off-by: Maxim Levitsky
Message-Id:
Signed-off-by: Paolo Bonzini

Maxim Levitsky
2020-11-08 17:41:28 +0800

28 Sep, 2020

4 commits

51de8151b KVM: x86: Add infrastructure for MSR filtering ... Browse Code »

In the following commits we will add pieces of MSR filtering.
To ensure that code compiles even with the feature half-merged, let's add
a few stubs and struct definitions before the real patches start.

Signed-off-by: Alexander Graf

Message-Id:
Signed-off-by: Paolo Bonzini

Alexander Graf
2020-09-28 19:58:05 +0800
9715092f8 KVM: X86: Move handling of INVPCID types to x86 ... Browse Code »

INVPCID instruction handling is mostly same across both VMX and
SVM. So, move the code to common x86.c.

Signed-off-by: Babu Moger
Reviewed-by: Jim Mattson
Message-Id:
Signed-off-by: Paolo Bonzini

Babu Moger
2020-09-28 19:57:17 +0800
3f3393b3c KVM: X86: Rename and move the function vmx_handle_memory_failure to x86.c ... Browse Code »

Handling of kvm_read/write_guest_virt*() errors can be moved to common
code. The same code can be used by both VMX and SVM.

Signed-off-by: Babu Moger
Reviewed-by: Jim Mattson
Message-Id:
Signed-off-by: Paolo Bonzini

Babu Moger
2020-09-28 19:57:16 +0800
68ca7663c KVM: LAPIC: Narrow down the kick target vCPU ... Browse Code »

The kick after setting KVM_REQ_PENDING_TIMER is used to handle the timer
fires on a different pCPU which vCPU is running on. This kick costs about
1000 clock cycles and we don't need this when injecting already-expired
timer or when using the VMX preemption timer because
kvm_lapic_expired_hv_timer() is called from the target vCPU.

Reviewed-by: Sean Christopherson
Signed-off-by: Wanpeng Li
Message-Id:
Signed-off-by: Paolo Bonzini

Wanpeng Li
2020-09-28 19:57:09 +0800

11 Jul, 2020

1 commit

897861479 KVM: x86: Add helper functions for illegal GPA checking and page fault injection ... Browse Code »

This patch adds two helper functions that will be used to support virtualizing
MAXPHYADDR in both kvm-intel.ko and kvm.ko.

kvm_fixup_and_inject_pf_error() injects a page fault for a user-specified GVA,
while kvm_mmu_is_illegal_gpa() checks whether a GPA exceeds vCPU address limits.

Signed-off-by: Mohammed Gamal
Signed-off-by: Paolo Bonzini
Message-Id:
Signed-off-by: Paolo Bonzini

Mohammed Gamal
2020-07-11 01:07:28 +0800

09 Jul, 2020

6 commits

841c2be09 kvm: x86: replace kvm_spec_ctrl_test_value with runtime test on the host ... Browse Code »

To avoid complex and in some cases incorrect logic in
kvm_spec_ctrl_test_value, just try the guest's given value on the host
processor instead, and if it doesn't #GP, allow the guest to set it.

One such case is when host CPU supports STIBP mitigation
but doesn't support IBRS (as is the case with some Zen2 AMD cpus),
and in this case we were giving guest #GP when it tried to use STIBP

The reason why can can do the host test is that IA32_SPEC_CTRL msr is
passed to the guest, after the guest sets it to a non zero value
for the first time (due to performance reasons),
and as as result of this, it is pointless to emulate #GP condition on
this first access, in a different way than what the host CPU does.

This is based on a patch from Sean Christopherson, who suggested this idea.

Fixes: 6441fa6178f5 ("KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL")
Cc: stable@vger.kernel.org
Suggested-by: Sean Christopherson
Signed-off-by: Maxim Levitsky
Message-Id:
Signed-off-by: Paolo Bonzini

Maxim Levitsky
2020-07-09 19:08:38 +0800
761e41693 KVM: nSVM: Check that MBZ bits in CR3 and CR4 are not set on vmrun of nested guests ... Browse Code »

According to section "Canonicalization and Consistency Checks" in APM vol. 2
the following guest state is illegal:

"Any MBZ bit of CR3 is set."
"Any MBZ bit of CR4 is set."

Suggeted-by: Paolo Bonzini
Signed-off-by: Krish Sadhukhan
Message-Id:
Signed-off-by: Paolo Bonzini

Krish Sadhukhan
2020-07-09 04:21:59 +0800
53efe527c KVM: x86: Make CR4.VMXE reserved for the guest ... Browse Code »

CR4.VMXE is reserved unless the VMX CPUID bit is set. On Intel,
it is also tested by vmx_set_cr4, but AMD relies on kvm_valid_cr4,
so fix it.

Reviewed-by: Jim Mattson
Signed-off-by: Paolo Bonzini

Paolo Bonzini
2020-07-09 04:21:58 +0800
b899c1327 KVM: x86: Create mask for guest CR4 reserved bits in kvm_update_cpuid() ... Browse Code »

Instead of creating the mask for guest CR4 reserved bits in kvm_valid_cr4(),
do it in kvm_update_cpuid() so that it can be reused instead of creating it
each time kvm_valid_cr4() is called.

Suggested-by: Paolo Bonzini
Signed-off-by: Krish Sadhukhan
Message-Id:
Signed-off-by: Paolo Bonzini

Krish Sadhukhan
2020-07-09 04:21:58 +0800
f5f6145e4 KVM: x86: Move the check for upper 32 reserved bits of DR6 to separate function ... Browse Code »

Signed-off-by: Krish Sadhukhan
Message-Id:
Signed-off-by: Paolo Bonzini

Krish Sadhukhan
2020-07-09 04:21:40 +0800
6abe9c138 KVM: X86: Move ignore_msrs handling upper the stack ... Browse Code »

MSR accesses can be one of:

(1) KVM internal access,
(2) userspace access (e.g., via KVM_SET_MSRS ioctl),
(3) guest access.

The ignore_msrs was previously handled by kvm_get_msr_common() and
kvm_set_msr_common(), which is the bottom of the msr access stack. It's
working in most cases, however it could dump unwanted warning messages to dmesg
even if kvm get/set the msrs internally when calling __kvm_set_msr() or
__kvm_get_msr() (e.g. kvm_cpuid()). Ideally we only want to trap cases (2)
or (3), but not (1) above.

To achieve this, move the ignore_msrs handling upper until the callers of
__kvm_get_msr() and __kvm_set_msr(). To identify the "msr missing" event, a
new return value (KVM_MSR_RET_INVALID==2) is used for that.

Signed-off-by: Peter Xu
Message-Id:
Signed-off-by: Paolo Bonzini

Peter Xu
2020-07-09 04:21:39 +0800

16 May, 2020

2 commits

404d5d7bf KVM: X86: Introduce more exit_fastpath_completion enum values ... Browse Code »

Adds a fastpath_t typedef since enum lines are a bit long, and replace
EXIT_FASTPATH_SKIP_EMUL_INS with two new exit_fastpath_completion enum values.

- EXIT_FASTPATH_EXIT_HANDLED kvm will still go through it's full run loop,
but it would skip invoking the exit handler.

- EXIT_FASTPATH_REENTER_GUEST complete fastpath, guest can be re-entered
without invoking the exit handler or going
back to vcpu_run

Tested-by: Haiwei Li
Cc: Haiwei Li
Signed-off-by: Wanpeng Li
Message-Id:
Signed-off-by: Paolo Bonzini

Wanpeng Li
2020-05-16 00:26:19 +0800
5a9f54435 KVM: X86: Introduce kvm_vcpu_exit_request() helper ... Browse Code »

Introduce kvm_vcpu_exit_request() helper, we need to check some conditions
before enter guest again immediately, we skip invoking the exit handler and
go through full run loop if complete fastpath but there is stuff preventing
we enter guest again immediately.

Tested-by: Haiwei Li
Cc: Haiwei Li
Signed-off-by: Wanpeng Li
Message-Id:
Signed-off-by: Paolo Bonzini

Wanpeng Li
2020-05-16 00:26:19 +0800

21 Apr, 2020

1 commit

eeeb4f67a KVM: x86: Introduce KVM_REQ_TLB_FLUSH_CURRENT to flush current ASID ... Browse Code »

Add KVM_REQ_TLB_FLUSH_CURRENT to allow optimized TLB flushing of VMX's
EPTP/VPID contexts[*] from the KVM MMU and/or in a deferred manner, e.g.
to flush L2's context during nested VM-Enter.

Convert KVM_REQ_TLB_FLUSH to KVM_REQ_TLB_FLUSH_CURRENT in flows where
the flush is directly associated with vCPU-scoped instruction emulation,
i.e. MOV CR3 and INVPCID.

Add a comment in vmx_vcpu_load_vmcs() above its KVM_REQ_TLB_FLUSH to
make it clear that it deliberately requests a flush of all contexts.

Service any pending flush request on nested VM-Exit as it's possible a
nested VM-Exit could occur after requesting a flush for L2. Add the
same logic for nested VM-Enter even though it's _extremely_ unlikely
for flush to be pending on nested VM-Enter, but theoretically possible
(in the future) due to RSM (SMM) emulation.

[*] Intel also has an Address Space Identifier (ASID) concept, e.g.
EPTP+VPID+PCID == ASID, it's just not documented in the SDM because
the rules of invalidation are different based on which piece of the
ASID is being changed, i.e. whether the EPTP, VPID, or PCID context
must be invalidated.

Signed-off-by: Sean Christopherson
Message-Id:
Signed-off-by: Paolo Bonzini

Sean Christopherson
2020-04-21 21:12:53 +0800

31 Mar, 2020

1 commit

afaf0b2f9 KVM: x86: Copy kvm_x86_ops by value to eliminate layer of indirection ... Browse Code »

Replace the kvm_x86_ops pointer in common x86 with an instance of the
struct to save one pointer dereference when invoking functions. Copy the
struct by value to set the ops during kvm_init().

Arbitrarily use kvm_x86_ops.hardware_enable to track whether or not the
ops have been initialized, i.e. a vendor KVM module has been loaded.

Suggested-by: Paolo Bonzini
Signed-off-by: Sean Christopherson
Message-Id:
Reviewed-by: Vitaly Kuznetsov
Signed-off-by: Paolo Bonzini

Sean Christopherson
2020-03-31 22:48:08 +0800

17 Mar, 2020

5 commits

408e9a318 KVM: CPUID: add support for supervisor states ... Browse Code »

Current CPUID 0xd enumeration code does not support supervisor
states, because KVM only supports setting IA32_XSS to zero.
Change it instead to use a new variable supported_xss, to be
set from the hardware_setup callback which is in charge of CPU
capabilities.

Signed-off-by: Paolo Bonzini

Paolo Bonzini
2020-03-17 00:58:45 +0800
615a4ae1c KVM: x86: Make kvm_mpx_supported() an inline function ... Browse Code »

Expose kvm_mpx_supported() as a static inline so that it can be inlined
in kvm_intel.ko.

No functional change intended.

Reviewed-by: Xiaoyao Li
Reviewed-by: Vitaly Kuznetsov
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2020-03-17 00:58:10 +0800
cfc481810 KVM: x86: Calculate the supported xcr0 mask at load time ... Browse Code »

Add a new global variable, supported_xcr0, to track which xcr0 bits can
be exposed to the guest instead of calculating the mask on every call.
The supported bits are constant for a given instance of KVM.

This paves the way toward eliminating the ->mpx_supported() call in
kvm_mpx_supported(), e.g. eliminates multiple retpolines in VMX's nested
VM-Enter path, and eventually toward eliminating ->mpx_supported()
altogether.

No functional change intended.

Reviewed-by: Xiaoyao Li
Reviewed-by: Vitaly Kuznetsov
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2020-03-17 00:58:09 +0800
2f728d66e KVM: x86: Move kvm_emulate.h into KVM's private directory ... Browse Code »

Now that the emulation context is dynamically allocated and not embedded
in struct kvm_vcpu, move its header, kvm_emulate.h, out of the public
asm directory and into KVM's private x86 directory.

Signed-off-by: Sean Christopherson
Reviewed-by: Vitaly Kuznetsov
Signed-off-by: Paolo Bonzini

Sean Christopherson
2020-03-17 00:57:52 +0800
f0ed4760e KVM: x86: Move emulation-only helpers to emulate.c ... Browse Code »

Move ctxt_virt_addr_bits() and emul_is_noncanonical_address() from x86.h
to emulate.c. This eliminates all references to struct x86_emulate_ctxt
from x86.h, and sets the stage for a future patch to stop including
kvm_emulate.h in asm/kvm_host.h.

Signed-off-by: Sean Christopherson
Reviewed-by: Vitaly Kuznetsov
Signed-off-by: Paolo Bonzini

Sean Christopherson
2020-03-17 00:57:51 +0800

05 Feb, 2020

1 commit

9b5e85320 KVM: x86: Take a u64 when checking for a valid dr7 value ... Browse Code »

Take a u64 instead of an unsigned long in kvm_dr7_valid() to fix a build
warning on i386 due to right-shifting a 32-bit value by 32 when checking
for bits being set in dr7[63:32].

Alternatively, the warning could be resolved by rewriting the check to
use an i386-friendly method, but taking a u64 fixes another oddity on
32-bit KVM. Beause KVM implements natural width VMCS fields as u64s to
avoid layout issues between 32-bit and 64-bit, a devious guest can stuff
vmcs12->guest_dr7 with a 64-bit value even when both the guest and host
are 32-bit kernels. KVM eventually drops vmcs12->guest_dr7[63:32] when
propagating vmcs12->guest_dr7 to vmcs02, but ideally KVM would not rely
on that behavior for correctness.

Cc: Jim Mattson
Cc: Krish Sadhukhan
Fixes: ecb697d10f70 ("KVM: nVMX: Check GUEST_DR7 on vmentry of nested guests")
Reported-by: Randy Dunlap
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2020-02-05 22:17:45 +0800

28 Jan, 2020

2 commits

b91991bf6 KVM: nVMX: Check GUEST_DR7 on vmentry of nested guests ... Browse Code »

According to section "Checks on Guest Control Registers, Debug Registers, and
and MSRs" in Intel SDM vol 3C, the following checks are performed on vmentry
of nested guests:

If the "load debug controls" VM-entry control is 1, bits 63:32 in the DR7
field must be 0.

In KVM, GUEST_DR7 is set prior to the vmcs02 VM-entry by kvm_set_dr() and the
latter synthesizes a #GP if any bit in the high dword in the former is set.
Hence this field needs to be checked in software.

Signed-off-by: Krish Sadhukhan
Reviewed-by: Karl Heubaum
Signed-off-by: Paolo Bonzini

Krish Sadhukhan
2020-01-28 02:59:55 +0800
de761ea79 KVM: x86: Perform non-canonical checks in 32-bit KVM ... Browse Code »

Remove the CONFIG_X86_64 condition from the low level non-canonical
helpers to effectively enable non-canonical checks on 32-bit KVM.
Non-canonical checks are performed by hardware if the CPU *supports*
64-bit mode, whether or not the CPU is actually in 64-bit mode is
irrelevant.

For the most part, skipping non-canonical checks on 32-bit KVM is ok-ish
because 32-bit KVM always (hopefully) drops bits 63:32 of whatever value
it's checking before propagating it to hardware, and architecturally,
the expected behavior for the guest is a bit of a grey area since the
vCPU itself doesn't support 64-bit mode. I.e. a 32-bit KVM guest can
observe the missed checks in several paths, e.g. INVVPID and VM-Enter,
but it's debatable whether or not the missed checks constitute a bug
because technically the vCPU doesn't support 64-bit mode.

The primary motivation for enabling the non-canonical checks is defense
in depth. As mentioned above, a guest can trigger a missed check via
INVVPID or VM-Enter. INVVPID is straightforward as it takes a 64-bit
virtual address as part of its 128-bit INVVPID descriptor and fails if
the address is non-canonical, even if INVVPID is executed in 32-bit PM.
Nested VM-Enter is a bit more convoluted as it requires the guest to
write natural width VMCS fields via memory accesses and then VMPTRLD the
VMCS, but it's still possible. In both cases, KVM is saved from a true
bug only because its flows that propagate values to hardware (correctly)
take "unsigned long" parameters and so drop bits 63:32 of the bad value.

Explicitly performing the non-canonical checks makes it less likely that
a bad value will be propagated to hardware, e.g. in the INVVPID case,
if __invvpid() didn't implicitly drop bits 63:32 then KVM would BUG() on
the resulting unexpected INVVPID failure due to hardware rejecting the
non-canonical address.

The only downside to enabling the non-canonical checks is that it adds a
relatively small amount of overhead, but the affected flows are not hot
paths, i.e. the overhead is negligible.

Note, KVM technically could gate the non-canonical checks on 32-bit KVM
with static_cpu_has(X86_FEATURE_LM), but on bare metal that's an even
bigger waste of code for everyone except the 0.00000000000001% of the
population running on Yonah, and nested 32-bit on 64-bit already fudges
things with respect to 64-bit CPU behavior.

Signed-off-by: Sean Christopherson
[Also do so in nested_vmx_check_host_state as reported by Krish. - Paolo]
Signed-off-by: Paolo Bonzini

Sean Christopherson
2020-01-28 02:59:50 +0800

24 Jan, 2020

1 commit

6441fa617 KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL ... Browse Code »

If the guest is configured to have SPEC_CTRL but the host does not
(which is a nonsensical configuration but these are not explicitly
forbidden) then a host-initiated MSR write can write vmx->spec_ctrl
(respectively svm->spec_ctrl) and trigger a #GP when KVM tries to
restore the host value of the MSR. Add a more comprehensive check
for valid bits of SPEC_CTRL, covering host CPUID flags and,
since we are at it and it is more correct that way, guest CPUID
flags too.

For AMD, remove the unnecessary is_guest_mode check around setting
the MSR interception bitmap, so that the code looks the same as
for Intel.

Cc: Jim Mattson
Signed-off-by: Paolo Bonzini

Paolo Bonzini
2020-01-24 16:18:47 +0800

21 Jan, 2020

2 commits

a0a2260c1 KVM: x86: Move bit() helper to cpuid.h ... Browse Code »

Move bit() to cpuid.h in preparation for incorporating the reverse_cpuid
array in bit() build-time assertions. Opportunistically use the BIT()
macro instead of open-coding the shift.

No functional change intended.

Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2020-01-21 20:58:27 +0800
1e9e2622a KVM: VMX: FIXED+PHYSICAL mode single target IPI fastpath ... Browse Code »

ICR and TSCDEADLINE MSRs write cause the main MSRs write vmexits in our
product observation, multicast IPIs are not as common as unicast IPI like
RESCHEDULE_VECTOR and CALL_FUNCTION_SINGLE_VECTOR etc.

This patch introduce a mechanism to handle certain performance-critical
WRMSRs in a very early stage of KVM VMExit handler.

This mechanism is specifically used for accelerating writes to x2APIC ICR
that attempt to send a virtual IPI with physical destination-mode, fixed
delivery-mode and single target. Which was found as one of the main causes
of VMExits for Linux workloads.

The reason this mechanism significantly reduce the latency of such virtual
IPIs is by sending the physical IPI to the target vCPU in a very early stage
of KVM VMExit handler, before host interrupts are enabled and before expensive
operations such as reacquiring KVM’s SRCU lock.
Latency is reduced even more when KVM is able to use APICv posted-interrupt
mechanism (which allows to deliver the virtual IPI directly to target vCPU
without the need to kick it to host).

Testing on Xeon Skylake server:

The virtual IPI latency from sender send to receiver receive reduces
more than 200+ cpu cycles.

Reviewed-by: Liran Alon
Cc: Paolo Bonzini
Cc: Radim Krčmář
Cc: Sean Christopherson
Cc: Vitaly Kuznetsov
Cc: Liran Alon
Signed-off-by: Wanpeng Li
Signed-off-by: Paolo Bonzini

Wanpeng Li
2020-01-21 20:57:12 +0800

09 Jan, 2020

1 commit

736c291c9 KVM: x86: Use gpa_t for cr2/gpa to fix TDP support on 32-bit KVM ... Browse Code »

Convert a plethora of parameters and variables in the MMU and page fault
flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.

Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
addresses. When TDP is enabled, the fault address is a guest physical
address and thus can be a 64-bit value, even when both KVM and its guest
are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
64-bit field, not a natural width field.

Using a gva_t for the fault address means KVM will incorrectly drop the
upper 32-bits of the GPA. Ditto for gva_to_gpa() when it is used to
translate L2 GPAs to L1 GPAs.

Opportunistically rename variables and parameters to better reflect the
dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
"addr" instead of "vaddr" when the address may be either a GVA or an L2
GPA. Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
a confusing "gpa_t gva" declaration; this also sets the stage for a
future patch to combing nonpaging_page_fault() and tdp_page_fault() with
minimal churn.

Sprinkle in a few comments to document flows where an address is known
to be a GVA and thus can be safely truncated to a 32-bit value. Add
WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
document such cases and detect bugs.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2020-01-09 01:16:02 +0800

15 Nov, 2019

1 commit

27cbe7d61 KVM: x86: Prevent set vCPU into INIT/SIPI_RECEIVED state when INIT are latched ... Browse Code »

Commit 4b9852f4f389 ("KVM: x86: Fix INIT signal handling in various CPU states")
fixed KVM to also latch pending LAPIC INIT event when vCPU is in VMX
operation.

However, current API of KVM_SET_MP_STATE allows userspace to put vCPU
into KVM_MP_STATE_SIPI_RECEIVED or KVM_MP_STATE_INIT_RECEIVED even when
vCPU is in VMX operation.

Fix this by introducing a util method to check if vCPU state latch INIT
signals and use it in KVM_SET_MP_STATE handler.

Fixes: 4b9852f4f389 ("KVM: x86: Fix INIT signal handling in various CPU states")
Reported-by: Sean Christopherson
Reviewed-by: Mihai Carabas
Signed-off-by: Liran Alon
Signed-off-by: Paolo Bonzini

Liran Alon
2019-11-15 18:44:00 +0800

22 Oct, 2019

2 commits

139a12cfe KVM: x86: Move IA32_XSS-swapping on VM-entry/VM-exit to common x86 code ... Browse Code »

Hoist the vendor-specific code related to loading the hardware IA32_XSS
MSR with guest/host values on VM-entry/VM-exit to common x86 code.

Reviewed-by: Jim Mattson
Signed-off-by: Aaron Lewis
Change-Id: Ic6e3430833955b98eb9b79ae6715cf2a3fdd6d82
Signed-off-by: Paolo Bonzini

Aaron Lewis
2019-10-22 21:46:53 +0800
489cbcf01 KVM: x86: Add WARNs to detect out-of-bounds register indices ... Browse Code »

Add WARN_ON_ONCE() checks in kvm_register_{read,write}() to detect reg
values that would cause KVM to overflow vcpu->arch.regs. Change the reg
param to an 'int' to make it clear that the reg index is unverified.

Regarding the overhead of WARN_ON_ONCE(), now that all fixed GPR reads
and writes use dedicated accessors, e.g. kvm_rax_read(), the overhead
is limited to flows where the reg index is generated at runtime. And
there is at least one historical bug where KVM has generated an out-of-
bounds access to arch.regs (see commit b68f3cc7d9789, "KVM: x86: Always
use 32-bit SMRAM save state for 32-bit kernels").

Adding the WARN_ON_ONCE() protection paves the way for additional
cleanup related to kvm_reg and kvm_reg_ex.

Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2019-10-22 19:34:13 +0800

24 Sep, 2019

1 commit

9497e1f2e KVM: x86: Move triple fault request into RM int injection ... Browse Code »

Request triple fault in kvm_inject_realmode_interrupt() instead of
returning EMULATE_FAIL and deferring to the caller. All existing
callers request triple fault and it's highly unlikely Real Mode is
going to acquire new features. While this consolidates a small amount
of code, the real goal is to remove the last reference to EMULATE_FAIL.

No functional change intended.

Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2019-09-24 20:31:20 +0800

22 Aug, 2019

1 commit

871bd0346 KVM: x86: Rename access permissions cache member in struct kvm_vcpu_arch ... Browse Code »

Rename "access" to "mmio_access" to match the other MMIO cache members
and to make it more obvious that it's tracking the access permissions
for the MMIO cache.

Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2019-08-22 16:09:23 +0800

20 Jul, 2019

1 commit

0c5f81dad KVM: LAPIC: Inject timer interrupt via posted interrupt ... Browse Code »

Dedicated instances are currently disturbed by unnecessary jitter due
to the emulated lapic timers firing on the same pCPUs where the
vCPUs reside. There is no hardware virtual timer on Intel for guest
like ARM, so both programming timer in guest and the emulated timer fires
incur vmexits. This patch tries to avoid vmexit when the emulated timer
fires, at least in dedicated instance scenario when nohz_full is enabled.

In that case, the emulated timers can be offload to the nearest busy
housekeeping cpus since APICv has been found for several years in server
processors. The guest timer interrupt can then be injected via posted interrupts,
which are delivered by the housekeeping cpu once the emulated timer fires.

The host should tuned so that vCPUs are placed on isolated physical
processors, and with several pCPUs surplus for busy housekeeping.
If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
~3% redis performance benefit can be observed on Skylake server, and the
number of external interrupt vmexits drops substantially. Without patch

VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
EXTERNAL_INTERRUPT 42916 49.43% 39.30% 0.47us 106.09us 0.71us ( +- 1.09% )

While with patch:

VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
EXTERNAL_INTERRUPT 6871 9.29% 2.96% 0.44us 57.88us 0.72us ( +- 4.02% )

Cc: Paolo Bonzini
Cc: Radim Krčmář
Cc: Marcelo Tosatti
Signed-off-by: Wanpeng Li
Signed-off-by: Paolo Bonzini

Wanpeng Li
2019-07-20 15:00:40 +0800

18 Jun, 2019

1 commit

bf03d4f93 KVM: x86: introduce is_pae_paging ... Browse Code »

Checking for 32-bit PAE is quite common around code that fiddles with
the PDPTRs. Add a function to compress all checks into a single
invocation.

Reviewed-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Paolo Bonzini
2019-06-18 17:47:38 +0800

05 Jun, 2019

1 commit

b51700632 KVM: X86: Provide a capability to disable cstate msr read intercepts ... Browse Code »

Allow guest reads CORE cstate when exposing host CPU power management capabilities
to the guest. PKG cstate is restricted to avoid a guest to get the whole package
information in multi-tenant scenario.

Cc: Paolo Bonzini
Cc: Radim Krčmář
Cc: Sean Christopherson
Cc: Liran Alon
Signed-off-by: Wanpeng Li
Signed-off-by: Paolo Bonzini

Wanpeng Li
2019-06-05 01:27:35 +0800

18 May, 2019

1 commit

0ef0fd351 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull KVM updates from Paolo Bonzini:
"ARM:
- support for SVE and Pointer Authentication in guests
- PMU improvements

POWER:
- support for direct access to the POWER9 XIVE interrupt controller
- memory and performance optimizations

x86:
- support for accessing memory not backed by struct page
- fixes and refactoring

Generic:
- dirty page tracking improvements"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits)
kvm: fix compilation on aarch64
Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU"
kvm: x86: Fix L1TF mitigation for shadow MMU
KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible
KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device
KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing"
KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs
kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete
tests: kvm: Add tests for KVM_SET_NESTED_STATE
KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state
tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID
tests: kvm: Add tests to .gitignore
KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one
KVM: Fix the bitmap range to copy during clear dirty
KVM: arm64: Fix ptrauth ID register masking logic
KVM: x86: use direct accessors for RIP and RSP
KVM: VMX: Use accessors for GPRs outside of dedicated caching logic
KVM: x86: Omit caching logic for always-available GPRs
kvm, x86: Properly check whether a pfn is an MMIO or not
...

Linus Torvalds
2019-05-18 01:33:30 +0800

19 Apr, 2019

1 commit

39497d766 KVM: lapic: Track lapic timer advance per vCPU ... Browse Code »

Automatically adjusting the globally-shared timer advancement could
corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting
the advancement value. That could be partially fixed by using a local
variable for the arithmetic, but it would still be susceptible to a
race when setting timer_advance_adjust_done.

And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the
correct calibration for a given vCPU may not apply to all vCPUs.

Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is
effectively violated when finding a stable advancement takes an extended
amount of timer.

Opportunistically change the definition of lapic_timer_advance_ns to
a u32 so that it matches the style of struct kvm_timer. Explicitly
pass the param to kvm_create_lapic() so that it doesn't have to be
exposed to lapic.c, thus reducing the probability of unintentionally
using the global value instead of the per-vCPU value.

Cc: Liran Alon
Cc: Wanpeng Li
Reviewed-by: Liran Alon
Cc: stable@vger.kernel.org
Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
Signed-off-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Sean Christopherson
2019-04-19 00:55:41 +0800

16 Apr, 2019

2 commits

674ea351c KVM: x86: optimize check for valid PAT value ... Browse Code »

This check will soon be done on every nested vmentry and vmexit,
"parallelize" it using bitwise operations.

Reviewed-by: Sean Christopherson
Signed-off-by: Paolo Bonzini

Paolo Bonzini
2019-04-16 21:39:02 +0800
1811d979c x86/kvm: move kvm_load/put_guest_xcr0 into atomic context ... Browse Code »

guest xcr0 could leak into host when MCE happens in guest mode. Because
do_machine_check() could schedule out at a few places.

For example:

kvm_load_guest_xcr0
...
kvm_x86_ops->run(vcpu) {
vmx_vcpu_run
vmx_complete_atomic_exit
kvm_machine_check
do_machine_check
do_memory_failure
memory_failure
lock_page

In this case, host_xcr0 is 0x2ff, guest vcpu xcr0 is 0xff. After schedule
out, host cpu has guest xcr0 loaded (0xff).

In __switch_to {
switch_fpu_finish
copy_kernel_to_fpregs
XRSTORS

If any bit i in XSTATE_BV[i] == 1 and xcr0[i] == 0, XRSTORS will
generate #GP (In this case, bit 9). Then ex_handler_fprestore kicks in
and tries to reinitialize fpu by restoring init fpu state. Same story as
last #GP, except we get DOUBLE FAULT this time.

Cc: stable@vger.kernel.org
Signed-off-by: WANG Chao
Signed-off-by: Paolo Bonzini

WANG Chao
2019-04-16 21:37:33 +0800