08 Nov, 2020

1 commit

  • Recent introduction of the userspace msr filtering added code that uses
    negative error codes for cases that result in either #GP delivery to
    the guest, or handled by the userspace msr filtering.

    This breaks an assumption that a negative error code returned from the
    msr emulation code is a semi-fatal error which should be returned
    to userspace via KVM_RUN ioctl and usually kill the guest.

    Fix this by reusing the already existing KVM_MSR_RET_INVALID error code,
    and by adding a new KVM_MSR_RET_FILTERED error code for the
    userspace filtered msrs.

    Fixes: 291f35fb2c1d1 ("KVM: x86: report negative values from wrmsr emulation to userspace")
    Reported-by: Qian Cai
    Signed-off-by: Maxim Levitsky
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Maxim Levitsky
     

28 Sep, 2020

4 commits


11 Jul, 2020

1 commit


09 Jul, 2020

6 commits

  • To avoid complex and in some cases incorrect logic in
    kvm_spec_ctrl_test_value, just try the guest's given value on the host
    processor instead, and if it doesn't #GP, allow the guest to set it.

    One such case is when host CPU supports STIBP mitigation
    but doesn't support IBRS (as is the case with some Zen2 AMD cpus),
    and in this case we were giving guest #GP when it tried to use STIBP

    The reason why can can do the host test is that IA32_SPEC_CTRL msr is
    passed to the guest, after the guest sets it to a non zero value
    for the first time (due to performance reasons),
    and as as result of this, it is pointless to emulate #GP condition on
    this first access, in a different way than what the host CPU does.

    This is based on a patch from Sean Christopherson, who suggested this idea.

    Fixes: 6441fa6178f5 ("KVM: x86: avoid incorrect writes to host MSR_IA32_SPEC_CTRL")
    Cc: stable@vger.kernel.org
    Suggested-by: Sean Christopherson
    Signed-off-by: Maxim Levitsky
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Maxim Levitsky
     
  • According to section "Canonicalization and Consistency Checks" in APM vol. 2
    the following guest state is illegal:

    "Any MBZ bit of CR3 is set."
    "Any MBZ bit of CR4 is set."

    Suggeted-by: Paolo Bonzini
    Signed-off-by: Krish Sadhukhan
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Krish Sadhukhan
     
  • CR4.VMXE is reserved unless the VMX CPUID bit is set. On Intel,
    it is also tested by vmx_set_cr4, but AMD relies on kvm_valid_cr4,
    so fix it.

    Reviewed-by: Jim Mattson
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • Instead of creating the mask for guest CR4 reserved bits in kvm_valid_cr4(),
    do it in kvm_update_cpuid() so that it can be reused instead of creating it
    each time kvm_valid_cr4() is called.

    Suggested-by: Paolo Bonzini
    Signed-off-by: Krish Sadhukhan
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Krish Sadhukhan
     
  • Signed-off-by: Krish Sadhukhan
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Krish Sadhukhan
     
  • MSR accesses can be one of:

    (1) KVM internal access,
    (2) userspace access (e.g., via KVM_SET_MSRS ioctl),
    (3) guest access.

    The ignore_msrs was previously handled by kvm_get_msr_common() and
    kvm_set_msr_common(), which is the bottom of the msr access stack. It's
    working in most cases, however it could dump unwanted warning messages to dmesg
    even if kvm get/set the msrs internally when calling __kvm_set_msr() or
    __kvm_get_msr() (e.g. kvm_cpuid()). Ideally we only want to trap cases (2)
    or (3), but not (1) above.

    To achieve this, move the ignore_msrs handling upper until the callers of
    __kvm_get_msr() and __kvm_set_msr(). To identify the "msr missing" event, a
    new return value (KVM_MSR_RET_INVALID==2) is used for that.

    Signed-off-by: Peter Xu
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Peter Xu
     

16 May, 2020

2 commits

  • Adds a fastpath_t typedef since enum lines are a bit long, and replace
    EXIT_FASTPATH_SKIP_EMUL_INS with two new exit_fastpath_completion enum values.

    - EXIT_FASTPATH_EXIT_HANDLED kvm will still go through it's full run loop,
    but it would skip invoking the exit handler.

    - EXIT_FASTPATH_REENTER_GUEST complete fastpath, guest can be re-entered
    without invoking the exit handler or going
    back to vcpu_run

    Tested-by: Haiwei Li
    Cc: Haiwei Li
    Signed-off-by: Wanpeng Li
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     
  • Introduce kvm_vcpu_exit_request() helper, we need to check some conditions
    before enter guest again immediately, we skip invoking the exit handler and
    go through full run loop if complete fastpath but there is stuff preventing
    we enter guest again immediately.

    Tested-by: Haiwei Li
    Cc: Haiwei Li
    Signed-off-by: Wanpeng Li
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     

21 Apr, 2020

1 commit

  • Add KVM_REQ_TLB_FLUSH_CURRENT to allow optimized TLB flushing of VMX's
    EPTP/VPID contexts[*] from the KVM MMU and/or in a deferred manner, e.g.
    to flush L2's context during nested VM-Enter.

    Convert KVM_REQ_TLB_FLUSH to KVM_REQ_TLB_FLUSH_CURRENT in flows where
    the flush is directly associated with vCPU-scoped instruction emulation,
    i.e. MOV CR3 and INVPCID.

    Add a comment in vmx_vcpu_load_vmcs() above its KVM_REQ_TLB_FLUSH to
    make it clear that it deliberately requests a flush of all contexts.

    Service any pending flush request on nested VM-Exit as it's possible a
    nested VM-Exit could occur after requesting a flush for L2. Add the
    same logic for nested VM-Enter even though it's _extremely_ unlikely
    for flush to be pending on nested VM-Enter, but theoretically possible
    (in the future) due to RSM (SMM) emulation.

    [*] Intel also has an Address Space Identifier (ASID) concept, e.g.
    EPTP+VPID+PCID == ASID, it's just not documented in the SDM because
    the rules of invalidation are different based on which piece of the
    ASID is being changed, i.e. whether the EPTP, VPID, or PCID context
    must be invalidated.

    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

31 Mar, 2020

1 commit

  • Replace the kvm_x86_ops pointer in common x86 with an instance of the
    struct to save one pointer dereference when invoking functions. Copy the
    struct by value to set the ops during kvm_init().

    Arbitrarily use kvm_x86_ops.hardware_enable to track whether or not the
    ops have been initialized, i.e. a vendor KVM module has been loaded.

    Suggested-by: Paolo Bonzini
    Signed-off-by: Sean Christopherson
    Message-Id:
    Reviewed-by: Vitaly Kuznetsov
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

17 Mar, 2020

5 commits

  • Current CPUID 0xd enumeration code does not support supervisor
    states, because KVM only supports setting IA32_XSS to zero.
    Change it instead to use a new variable supported_xss, to be
    set from the hardware_setup callback which is in charge of CPU
    capabilities.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • Expose kvm_mpx_supported() as a static inline so that it can be inlined
    in kvm_intel.ko.

    No functional change intended.

    Reviewed-by: Xiaoyao Li
    Reviewed-by: Vitaly Kuznetsov
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Add a new global variable, supported_xcr0, to track which xcr0 bits can
    be exposed to the guest instead of calculating the mask on every call.
    The supported bits are constant for a given instance of KVM.

    This paves the way toward eliminating the ->mpx_supported() call in
    kvm_mpx_supported(), e.g. eliminates multiple retpolines in VMX's nested
    VM-Enter path, and eventually toward eliminating ->mpx_supported()
    altogether.

    No functional change intended.

    Reviewed-by: Xiaoyao Li
    Reviewed-by: Vitaly Kuznetsov
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Now that the emulation context is dynamically allocated and not embedded
    in struct kvm_vcpu, move its header, kvm_emulate.h, out of the public
    asm directory and into KVM's private x86 directory.

    Signed-off-by: Sean Christopherson
    Reviewed-by: Vitaly Kuznetsov
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Move ctxt_virt_addr_bits() and emul_is_noncanonical_address() from x86.h
    to emulate.c. This eliminates all references to struct x86_emulate_ctxt
    from x86.h, and sets the stage for a future patch to stop including
    kvm_emulate.h in asm/kvm_host.h.

    Signed-off-by: Sean Christopherson
    Reviewed-by: Vitaly Kuznetsov
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

05 Feb, 2020

1 commit

  • Take a u64 instead of an unsigned long in kvm_dr7_valid() to fix a build
    warning on i386 due to right-shifting a 32-bit value by 32 when checking
    for bits being set in dr7[63:32].

    Alternatively, the warning could be resolved by rewriting the check to
    use an i386-friendly method, but taking a u64 fixes another oddity on
    32-bit KVM. Beause KVM implements natural width VMCS fields as u64s to
    avoid layout issues between 32-bit and 64-bit, a devious guest can stuff
    vmcs12->guest_dr7 with a 64-bit value even when both the guest and host
    are 32-bit kernels. KVM eventually drops vmcs12->guest_dr7[63:32] when
    propagating vmcs12->guest_dr7 to vmcs02, but ideally KVM would not rely
    on that behavior for correctness.

    Cc: Jim Mattson
    Cc: Krish Sadhukhan
    Fixes: ecb697d10f70 ("KVM: nVMX: Check GUEST_DR7 on vmentry of nested guests")
    Reported-by: Randy Dunlap
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

28 Jan, 2020

2 commits

  • According to section "Checks on Guest Control Registers, Debug Registers, and
    and MSRs" in Intel SDM vol 3C, the following checks are performed on vmentry
    of nested guests:

    If the "load debug controls" VM-entry control is 1, bits 63:32 in the DR7
    field must be 0.

    In KVM, GUEST_DR7 is set prior to the vmcs02 VM-entry by kvm_set_dr() and the
    latter synthesizes a #GP if any bit in the high dword in the former is set.
    Hence this field needs to be checked in software.

    Signed-off-by: Krish Sadhukhan
    Reviewed-by: Karl Heubaum
    Signed-off-by: Paolo Bonzini

    Krish Sadhukhan
     
  • Remove the CONFIG_X86_64 condition from the low level non-canonical
    helpers to effectively enable non-canonical checks on 32-bit KVM.
    Non-canonical checks are performed by hardware if the CPU *supports*
    64-bit mode, whether or not the CPU is actually in 64-bit mode is
    irrelevant.

    For the most part, skipping non-canonical checks on 32-bit KVM is ok-ish
    because 32-bit KVM always (hopefully) drops bits 63:32 of whatever value
    it's checking before propagating it to hardware, and architecturally,
    the expected behavior for the guest is a bit of a grey area since the
    vCPU itself doesn't support 64-bit mode. I.e. a 32-bit KVM guest can
    observe the missed checks in several paths, e.g. INVVPID and VM-Enter,
    but it's debatable whether or not the missed checks constitute a bug
    because technically the vCPU doesn't support 64-bit mode.

    The primary motivation for enabling the non-canonical checks is defense
    in depth. As mentioned above, a guest can trigger a missed check via
    INVVPID or VM-Enter. INVVPID is straightforward as it takes a 64-bit
    virtual address as part of its 128-bit INVVPID descriptor and fails if
    the address is non-canonical, even if INVVPID is executed in 32-bit PM.
    Nested VM-Enter is a bit more convoluted as it requires the guest to
    write natural width VMCS fields via memory accesses and then VMPTRLD the
    VMCS, but it's still possible. In both cases, KVM is saved from a true
    bug only because its flows that propagate values to hardware (correctly)
    take "unsigned long" parameters and so drop bits 63:32 of the bad value.

    Explicitly performing the non-canonical checks makes it less likely that
    a bad value will be propagated to hardware, e.g. in the INVVPID case,
    if __invvpid() didn't implicitly drop bits 63:32 then KVM would BUG() on
    the resulting unexpected INVVPID failure due to hardware rejecting the
    non-canonical address.

    The only downside to enabling the non-canonical checks is that it adds a
    relatively small amount of overhead, but the affected flows are not hot
    paths, i.e. the overhead is negligible.

    Note, KVM technically could gate the non-canonical checks on 32-bit KVM
    with static_cpu_has(X86_FEATURE_LM), but on bare metal that's an even
    bigger waste of code for everyone except the 0.00000000000001% of the
    population running on Yonah, and nested 32-bit on 64-bit already fudges
    things with respect to 64-bit CPU behavior.

    Signed-off-by: Sean Christopherson
    [Also do so in nested_vmx_check_host_state as reported by Krish. - Paolo]
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

24 Jan, 2020

1 commit

  • If the guest is configured to have SPEC_CTRL but the host does not
    (which is a nonsensical configuration but these are not explicitly
    forbidden) then a host-initiated MSR write can write vmx->spec_ctrl
    (respectively svm->spec_ctrl) and trigger a #GP when KVM tries to
    restore the host value of the MSR. Add a more comprehensive check
    for valid bits of SPEC_CTRL, covering host CPUID flags and,
    since we are at it and it is more correct that way, guest CPUID
    flags too.

    For AMD, remove the unnecessary is_guest_mode check around setting
    the MSR interception bitmap, so that the code looks the same as
    for Intel.

    Cc: Jim Mattson
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

21 Jan, 2020

2 commits

  • Move bit() to cpuid.h in preparation for incorporating the reverse_cpuid
    array in bit() build-time assertions. Opportunistically use the BIT()
    macro instead of open-coding the shift.

    No functional change intended.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • ICR and TSCDEADLINE MSRs write cause the main MSRs write vmexits in our
    product observation, multicast IPIs are not as common as unicast IPI like
    RESCHEDULE_VECTOR and CALL_FUNCTION_SINGLE_VECTOR etc.

    This patch introduce a mechanism to handle certain performance-critical
    WRMSRs in a very early stage of KVM VMExit handler.

    This mechanism is specifically used for accelerating writes to x2APIC ICR
    that attempt to send a virtual IPI with physical destination-mode, fixed
    delivery-mode and single target. Which was found as one of the main causes
    of VMExits for Linux workloads.

    The reason this mechanism significantly reduce the latency of such virtual
    IPIs is by sending the physical IPI to the target vCPU in a very early stage
    of KVM VMExit handler, before host interrupts are enabled and before expensive
    operations such as reacquiring KVM’s SRCU lock.
    Latency is reduced even more when KVM is able to use APICv posted-interrupt
    mechanism (which allows to deliver the virtual IPI directly to target vCPU
    without the need to kick it to host).

    Testing on Xeon Skylake server:

    The virtual IPI latency from sender send to receiver receive reduces
    more than 200+ cpu cycles.

    Reviewed-by: Liran Alon
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Sean Christopherson
    Cc: Vitaly Kuznetsov
    Cc: Liran Alon
    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     

09 Jan, 2020

1 commit

  • Convert a plethora of parameters and variables in the MMU and page fault
    flows from type gva_t to gpa_t to properly handle TDP on 32-bit KVM.

    Thanks to PSE and PAE paging, 32-bit kernels can access 64-bit physical
    addresses. When TDP is enabled, the fault address is a guest physical
    address and thus can be a 64-bit value, even when both KVM and its guest
    are using 32-bit virtual addressing, e.g. VMX's VMCS.GUEST_PHYSICAL is a
    64-bit field, not a natural width field.

    Using a gva_t for the fault address means KVM will incorrectly drop the
    upper 32-bits of the GPA. Ditto for gva_to_gpa() when it is used to
    translate L2 GPAs to L1 GPAs.

    Opportunistically rename variables and parameters to better reflect the
    dual address modes, e.g. use "cr2_or_gpa" for fault addresses and plain
    "addr" instead of "vaddr" when the address may be either a GVA or an L2
    GPA. Similarly, use "gpa" in the nonpaging_page_fault() flows to avoid
    a confusing "gpa_t gva" declaration; this also sets the stage for a
    future patch to combing nonpaging_page_fault() and tdp_page_fault() with
    minimal churn.

    Sprinkle in a few comments to document flows where an address is known
    to be a GVA and thus can be safely truncated to a 32-bit value. Add
    WARNs in kvm_handle_page_fault() and FNAME(gva_to_gpa_nested)() to help
    document such cases and detect bugs.

    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

15 Nov, 2019

1 commit

  • Commit 4b9852f4f389 ("KVM: x86: Fix INIT signal handling in various CPU states")
    fixed KVM to also latch pending LAPIC INIT event when vCPU is in VMX
    operation.

    However, current API of KVM_SET_MP_STATE allows userspace to put vCPU
    into KVM_MP_STATE_SIPI_RECEIVED or KVM_MP_STATE_INIT_RECEIVED even when
    vCPU is in VMX operation.

    Fix this by introducing a util method to check if vCPU state latch INIT
    signals and use it in KVM_SET_MP_STATE handler.

    Fixes: 4b9852f4f389 ("KVM: x86: Fix INIT signal handling in various CPU states")
    Reported-by: Sean Christopherson
    Reviewed-by: Mihai Carabas
    Signed-off-by: Liran Alon
    Signed-off-by: Paolo Bonzini

    Liran Alon
     

22 Oct, 2019

2 commits

  • Hoist the vendor-specific code related to loading the hardware IA32_XSS
    MSR with guest/host values on VM-entry/VM-exit to common x86 code.

    Reviewed-by: Jim Mattson
    Signed-off-by: Aaron Lewis
    Change-Id: Ic6e3430833955b98eb9b79ae6715cf2a3fdd6d82
    Signed-off-by: Paolo Bonzini

    Aaron Lewis
     
  • Add WARN_ON_ONCE() checks in kvm_register_{read,write}() to detect reg
    values that would cause KVM to overflow vcpu->arch.regs. Change the reg
    param to an 'int' to make it clear that the reg index is unverified.

    Regarding the overhead of WARN_ON_ONCE(), now that all fixed GPR reads
    and writes use dedicated accessors, e.g. kvm_rax_read(), the overhead
    is limited to flows where the reg index is generated at runtime. And
    there is at least one historical bug where KVM has generated an out-of-
    bounds access to arch.regs (see commit b68f3cc7d9789, "KVM: x86: Always
    use 32-bit SMRAM save state for 32-bit kernels").

    Adding the WARN_ON_ONCE() protection paves the way for additional
    cleanup related to kvm_reg and kvm_reg_ex.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

24 Sep, 2019

1 commit

  • Request triple fault in kvm_inject_realmode_interrupt() instead of
    returning EMULATE_FAIL and deferring to the caller. All existing
    callers request triple fault and it's highly unlikely Real Mode is
    going to acquire new features. While this consolidates a small amount
    of code, the real goal is to remove the last reference to EMULATE_FAIL.

    No functional change intended.

    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

22 Aug, 2019

1 commit


20 Jul, 2019

1 commit

  • Dedicated instances are currently disturbed by unnecessary jitter due
    to the emulated lapic timers firing on the same pCPUs where the
    vCPUs reside. There is no hardware virtual timer on Intel for guest
    like ARM, so both programming timer in guest and the emulated timer fires
    incur vmexits. This patch tries to avoid vmexit when the emulated timer
    fires, at least in dedicated instance scenario when nohz_full is enabled.

    In that case, the emulated timers can be offload to the nearest busy
    housekeeping cpus since APICv has been found for several years in server
    processors. The guest timer interrupt can then be injected via posted interrupts,
    which are delivered by the housekeeping cpu once the emulated timer fires.

    The host should tuned so that vCPUs are placed on isolated physical
    processors, and with several pCPUs surplus for busy housekeeping.
    If disabled mwait/hlt/pause vmexits keep the vCPUs in non-root mode,
    ~3% redis performance benefit can be observed on Skylake server, and the
    number of external interrupt vmexits drops substantially. Without patch

    VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
    EXTERNAL_INTERRUPT 42916 49.43% 39.30% 0.47us 106.09us 0.71us ( +- 1.09% )

    While with patch:

    VM-EXIT Samples Samples% Time% Min Time Max Time Avg time
    EXTERNAL_INTERRUPT 6871 9.29% 2.96% 0.44us 57.88us 0.72us ( +- 4.02% )

    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Marcelo Tosatti
    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     

18 Jun, 2019

1 commit


05 Jun, 2019

1 commit


18 May, 2019

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "ARM:
    - support for SVE and Pointer Authentication in guests
    - PMU improvements

    POWER:
    - support for direct access to the POWER9 XIVE interrupt controller
    - memory and performance optimizations

    x86:
    - support for accessing memory not backed by struct page
    - fixes and refactoring

    Generic:
    - dirty page tracking improvements"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits)
    kvm: fix compilation on aarch64
    Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU"
    kvm: x86: Fix L1TF mitigation for shadow MMU
    KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible
    KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device
    KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing"
    KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs
    kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete
    tests: kvm: Add tests for KVM_SET_NESTED_STATE
    KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state
    tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID
    tests: kvm: Add tests to .gitignore
    KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2
    KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one
    KVM: Fix the bitmap range to copy during clear dirty
    KVM: arm64: Fix ptrauth ID register masking logic
    KVM: x86: use direct accessors for RIP and RSP
    KVM: VMX: Use accessors for GPRs outside of dedicated caching logic
    KVM: x86: Omit caching logic for always-available GPRs
    kvm, x86: Properly check whether a pfn is an MMIO or not
    ...

    Linus Torvalds
     

19 Apr, 2019

1 commit

  • Automatically adjusting the globally-shared timer advancement could
    corrupt the timer, e.g. if multiple vCPUs are concurrently adjusting
    the advancement value. That could be partially fixed by using a local
    variable for the arithmetic, but it would still be susceptible to a
    race when setting timer_advance_adjust_done.

    And because virtual_tsc_khz and tsc_scaling_ratio are per-vCPU, the
    correct calibration for a given vCPU may not apply to all vCPUs.

    Furthermore, lapic_timer_advance_ns is marked __read_mostly, which is
    effectively violated when finding a stable advancement takes an extended
    amount of timer.

    Opportunistically change the definition of lapic_timer_advance_ns to
    a u32 so that it matches the style of struct kvm_timer. Explicitly
    pass the param to kvm_create_lapic() so that it doesn't have to be
    exposed to lapic.c, thus reducing the probability of unintentionally
    using the global value instead of the per-vCPU value.

    Cc: Liran Alon
    Cc: Wanpeng Li
    Reviewed-by: Liran Alon
    Cc: stable@vger.kernel.org
    Fixes: 3b8a5df6c4dc6 ("KVM: LAPIC: Tune lapic_timer_advance_ns automatically")
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     

16 Apr, 2019

2 commits

  • This check will soon be done on every nested vmentry and vmexit,
    "parallelize" it using bitwise operations.

    Reviewed-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • guest xcr0 could leak into host when MCE happens in guest mode. Because
    do_machine_check() could schedule out at a few places.

    For example:

    kvm_load_guest_xcr0
    ...
    kvm_x86_ops->run(vcpu) {
    vmx_vcpu_run
    vmx_complete_atomic_exit
    kvm_machine_check
    do_machine_check
    do_memory_failure
    memory_failure
    lock_page

    In this case, host_xcr0 is 0x2ff, guest vcpu xcr0 is 0xff. After schedule
    out, host cpu has guest xcr0 loaded (0xff).

    In __switch_to {
    switch_fpu_finish
    copy_kernel_to_fpregs
    XRSTORS

    If any bit i in XSTATE_BV[i] == 1 and xcr0[i] == 0, XRSTORS will
    generate #GP (In this case, bit 9). Then ex_handler_fprestore kicks in
    and tries to reinitialize fpu by restoring init fpu state. Same story as
    last #GP, except we get DOUBLE FAULT this time.

    Cc: stable@vger.kernel.org
    Signed-off-by: WANG Chao
    Signed-off-by: Paolo Bonzini

    WANG Chao