13 Jan, 2021

4 commits

  • commit 2f80d502d627f30257ba7e3655e71c373b7d1a5a upstream.

    Since we know that e >= s, we can reassociate the left shift,
    changing the shifted number from 1 to 2 in exchange for
    decreasing the right hand side by 1.

    Reported-by: syzbot+e87846c48bf72bc85311@syzkaller.appspotmail.com
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Paolo Bonzini
     
  • commit a889ea54b3daa63ee1463dc19ed699407d61458b upstream.

    Many TDP MMU functions which need to perform some action on all TDP MMU
    roots hold a reference on that root so that they can safely drop the MMU
    lock in order to yield to other threads. However, when releasing the
    reference on the root, there is a bug: the root will not be freed even
    if its reference count (root_count) is reduced to 0.

    To simplify acquiring and releasing references on TDP MMU root pages, and
    to ensure that these roots are properly freed, move the get/put operations
    into another TDP MMU root iterator macro.

    Moving the get/put operations into an iterator macro also helps
    simplify control flow when a root does need to be freed. Note that using
    the list_for_each_entry_safe macro would not have been appropriate in
    this situation because it could keep a pointer to the next root across
    an MMU lock release + reacquire, during which time that root could be
    freed.

    Reported-by: Maciej S. Szmigiero
    Suggested-by: Paolo Bonzini
    Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
    Fixes: 063afacd8730 ("kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU")
    Fixes: a6a0b05da9f3 ("kvm: x86/mmu: Support dirty logging for the TDP MMU")
    Fixes: 14881998566d ("kvm: x86/mmu: Support disabling dirty logging for the tdp MMU")
    Signed-off-by: Ben Gardon
    Message-Id:
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Ben Gardon
     
  • commit 39b4d43e6003cee51cd119596d3c33d0449eb44c upstream.

    Get the so called "root" level from the low level shadow page table
    walkers instead of manually attempting to calculate it higher up the
    stack, e.g. in get_mmio_spte(). When KVM is using PAE shadow paging,
    the starting level of the walk, from the callers perspective, is not
    the CR3 root but rather the PDPTR "root". Checking for reserved bits
    from the CR3 root causes get_mmio_spte() to consume uninitialized stack
    data due to indexing into sptes[] for a level that was not filled by
    get_walk(). This can result in false positives and/or negatives
    depending on what garbage happens to be on the stack.

    Opportunistically nuke a few extra newlines.

    Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU")
    Reported-by: Richard Herbert
    Cc: Ben Gardon
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     
  • commit 2aa078932ff6c66bf10cc5b3144440dbfa7d813d upstream.

    Return -1 from the get_walk() helpers if the shadow walk doesn't fill at
    least one spte, which can theoretically happen if the walk hits a
    not-present PDPTR. Returning the root level in such a case will cause
    get_mmio_spte() to return garbage (uninitialized stack data). In
    practice, such a scenario should be impossible as KVM shouldn't get a
    reserved-bit page fault with a not-present PDPTR.

    Note, using mmu->root_level in get_walk() is wrong for other reasons,
    too, but that's now a moot point.

    Fixes: 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU")
    Cc: Ben Gardon
    Cc: stable@vger.kernel.org
    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     

30 Dec, 2020

2 commits

  • commit 9d4747d02376aeb8de38afa25430de79129c5799 upstream.

    When both KVM support and the CCP driver are built into the kernel instead
    of as modules, KVM initialization can happen before CCP initialization. As
    a result, sev_platform_status() will return a failure when it is called
    from sev_hardware_setup(), when this isn't really an error condition.

    Since sev_platform_status() doesn't need to be called at this time anyway,
    remove the invocation from sev_hardware_setup().

    Signed-off-by: Tom Lendacky
    Message-Id:
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Tom Lendacky
     
  • commit 39485ed95d6b83b62fa75c06c2c4d33992e0d971 upstream.

    Until commit e7c587da1252 ("x86/speculation: Use synthetic bits for
    IBRS/IBPB/STIBP"), KVM was testing both Intel and AMD CPUID bits before
    allowing the guest to write MSR_IA32_SPEC_CTRL and MSR_IA32_PRED_CMD.
    Testing only Intel bits on VMX processors, or only AMD bits on SVM
    processors, fails if the guests are created with the "opposite" vendor
    as the host.

    While at it, also tweak the host CPU check to use the vendor-agnostic
    feature bit X86_FEATURE_IBPB, since we only care about the availability
    of the MSR on the host here and not about specific CPUID bits.

    Fixes: e7c587da1252 ("x86/speculation: Use synthetic bits for IBRS/IBPB/STIBP")
    Cc: stable@vger.kernel.org
    Reported-by: Denis V. Lunev
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Paolo Bonzini
     

13 Dec, 2020

1 commit

  • Pull kvm fixes from Paolo Bonzini:
    "Bugfixes for ARM, x86 and tools"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    tools/kvm_stat: Exempt time-based counters
    KVM: mmu: Fix SPTE encoding of MMIO generation upper half
    kvm: x86/mmu: Use cpuid to determine max gfn
    kvm: svm: de-allocate svm_cpu_data for all cpus in svm_cpu_uninit()
    selftests: kvm/set_memory_region_test: Fix race in move region test
    KVM: arm64: Add usage of stage 2 fault lookup level in user_mem_abort()
    KVM: arm64: Fix handling of merging tables into a block entry
    KVM: arm64: Fix memory leak on stage2 update of a valid PTE

    Linus Torvalds
     

12 Dec, 2020

1 commit

  • Commit cae7ed3c2cb0 ("KVM: x86: Refactor the MMIO SPTE generation handling")
    cleaned up the computation of MMIO generation SPTE masks, however it
    introduced a bug how the upper part was encoded:
    SPTE bits 52-61 were supposed to contain bits 10-19 of the current
    generation number, however a missing shift encoded bits 1-10 there instead
    (mostly duplicating the lower part of the encoded generation number that
    then consisted of bits 1-9).

    In the meantime, the upper part was shrunk by one bit and moved by
    subsequent commits to become an upper half of the encoded generation number
    (bits 9-17 of bits 0-17 encoded in a SPTE).

    In addition to the above, commit 56871d444bc4 ("KVM: x86: fix overlap between SPTE_MMIO_MASK and generation")
    has changed the SPTE bit range assigned to encode the generation number and
    the total number of bits encoded but did not update them in the comment
    attached to their defines, nor in the KVM MMU doc.
    Let's do it here, too, since it is too trivial thing to warrant a separate
    commit.

    Fixes: cae7ed3c2cb0 ("KVM: x86: Refactor the MMIO SPTE generation handling")
    Signed-off-by: Maciej S. Szmigiero
    Message-Id:
    Cc: stable@vger.kernel.org
    [Reorganize macros so that everything is computed from the bit ranges. - Paolo]
    Signed-off-by: Paolo Bonzini

    Maciej S. Szmigiero
     

04 Dec, 2020

2 commits

  • In the TDP MMU, use shadow_phys_bits to dermine the maximum possible GFN
    mapped in the guest for zapping operations. boot_cpu_data.x86_phys_bits
    may be reduced in the case of HW features that steal HPA bits for other
    purposes. However, this doesn't necessarily reduce GPA space that can be
    accessed via TDP. So zap based on a maximum gfn calculated with MAXPHYADDR
    retrieved from CPUID. This is already stored in shadow_phys_bits, so use
    it instead of x86_phys_bits.

    Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
    Signed-off-by: Rick Edgecombe
    Message-Id:
    Reviewed-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini

    Rick Edgecombe
     
  • The cpu arg for svm_cpu_uninit() was previously ignored resulting in the
    per cpu structure svm_cpu_data not being de-allocated for all cpus.

    Signed-off-by: Jacob Xu
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Jacob Xu
     

28 Nov, 2020

2 commits

  • Pull kvm fixes from Paolo Bonzini:
    "ARM:
    - Fix alignment of the new HYP sections
    - Fix GICR_TYPER access from userspace

    S390:
    - do not reset the global diag318 data for per-cpu reset
    - do not mark memory as protected too early
    - fix for destroy page ultravisor call

    x86:
    - fix for SEV debugging
    - fix incorrect return code
    - fix for 'noapic' with PIC in userspace and LAPIC in kernel
    - fix for 5-level paging"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    kvm: x86/mmu: Fix get_mmio_spte() on CPUs supporting 5-level PT
    KVM: x86: Fix split-irqchip vs interrupt injection window request
    KVM: x86: handle !lapic_in_kernel case in kvm_cpu_*_extint
    MAINTAINERS: Update email address for Sean Christopherson
    MAINTAINERS: add uv.c also to KVM/s390
    s390/uv: handle destroy page legacy interface
    KVM: arm64: vgic-v3: Drop the reporting of GICR_TYPER.Last for userspace
    KVM: SVM: fix error return code in svm_create_vcpu()
    KVM: SVM: Fix offset computation bug in __sev_dbg_decrypt().
    KVM: arm64: Correctly align nVHE percpu data
    KVM: s390: remove diag318 reset code
    KVM: s390: pv: Mark mm as protected after the set secure parameters and improve cleanup

    Linus Torvalds
     
  • Commit 95fb5b0258b7 ("kvm: x86/mmu: Support MMIO in the TDP MMU") caused
    the following WARNING on an Intel Ice Lake CPU:

    get_mmio_spte: detect reserved bits on spte, addr 0xb80a0, dump hierarchy:
    ------ spte 0xb80a0 level 5.
    ------ spte 0xfcd210107 level 4.
    ------ spte 0x1004c40107 level 3.
    ------ spte 0x1004c41107 level 2.
    ------ spte 0x1db00000000b83b6 level 1.
    WARNING: CPU: 109 PID: 10254 at arch/x86/kvm/mmu/mmu.c:3569 kvm_mmu_page_fault.cold.150+0x54/0x22f [kvm]
    ...
    Call Trace:
    ? kvm_io_bus_get_first_dev+0x55/0x110 [kvm]
    vcpu_enter_guest+0xaa1/0x16a0 [kvm]
    ? vmx_get_cs_db_l_bits+0x17/0x30 [kvm_intel]
    ? skip_emulated_instruction+0xaa/0x150 [kvm_intel]
    kvm_arch_vcpu_ioctl_run+0xca/0x520 [kvm]

    The guest triggering this crashes. Note, this happens with the traditional
    MMU and EPT enabled, not with the newly introduced TDP MMU. Turns out,
    there was a subtle change in the above mentioned commit. Previously,
    walk_shadow_page_get_mmio_spte() was setting 'root' to 'iterator.level'
    which is returned by shadow_walk_init() and this equals to
    'vcpu->arch.mmu->shadow_root_level'. Now, get_mmio_spte() sets it to
    'int root = vcpu->arch.mmu->root_level'.

    The difference between 'root_level' and 'shadow_root_level' on CPUs
    supporting 5-level page tables is that in some case we don't want to
    use 5-level, in particular when 'cpuid_maxphyaddr(vcpu)
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Vitaly Kuznetsov
     

27 Nov, 2020

2 commits

  • kvm_cpu_accept_dm_intr and kvm_vcpu_ready_for_interrupt_injection are
    a hodge-podge of conditions, hacked together to get something that
    more or less works. But what is actually needed is much simpler;
    in both cases the fundamental question is, do we have a place to stash
    an interrupt if userspace does KVM_INTERRUPT?

    In userspace irqchip mode, that is !vcpu->arch.interrupt.injected.
    Currently kvm_event_needs_reinjection(vcpu) covers it, but it is
    unnecessarily restrictive.

    In split irqchip mode it's a bit more complicated, we need to check
    kvm_apic_accept_pic_intr(vcpu) (the IRQ window exit is basically an INTACK
    cycle and thus requires ExtINTs not to be masked) as well as
    !pending_userspace_extint(vcpu). However, there is no need to
    check kvm_event_needs_reinjection(vcpu), since split irqchip keeps
    pending ExtINT state separate from event injection state, and checking
    kvm_cpu_has_interrupt(vcpu) is wrong too since ExtINT has higher
    priority than APIC interrupts. In fact the latter fixes a bug:
    when userspace requests an IRQ window vmexit, an interrupt in the
    local APIC can cause kvm_cpu_has_interrupt() to be true and thus
    kvm_vcpu_ready_for_interrupt_injection() to return false. When this
    happens, vcpu_run does not exit to userspace but the interrupt window
    vmexits keep occurring. The VM loops without any hope of making progress.

    Once we try to fix these with something like

    return kvm_arch_interrupt_allowed(vcpu) &&
    - !kvm_cpu_has_interrupt(vcpu) &&
    - !kvm_event_needs_reinjection(vcpu) &&
    - kvm_cpu_accept_dm_intr(vcpu);
    + (!lapic_in_kernel(vcpu)
    + ? !vcpu->arch.interrupt.injected
    + : (kvm_apic_accept_pic_intr(vcpu)
    + && !pending_userspace_extint(v)));

    we realize two things. First, thanks to the previous patch the complex
    conditional can reuse !kvm_cpu_has_extint(vcpu). Second, the interrupt
    window request in vcpu_enter_guest()

    bool req_int_win =
    dm_request_for_irq_injection(vcpu) &&
    kvm_cpu_accept_dm_intr(vcpu);

    should be kept in sync with kvm_vcpu_ready_for_interrupt_injection():
    it is unnecessary to ask the processor for an interrupt window
    if we would not be able to return to userspace. Therefore,
    kvm_cpu_accept_dm_intr(vcpu) is basically !kvm_cpu_has_extint(vcpu)
    ANDed with the existing check for masked ExtINT. It all makes sense:

    - we can accept an interrupt from userspace if there is a place
    to stash it (and, for irqchip split, ExtINTs are not masked).
    Interrupts from userspace _can_ be accepted even if right now
    EFLAGS.IF=0.

    - in order to tell userspace we will inject its interrupt ("IRQ
    window open" i.e. kvm_vcpu_ready_for_interrupt_injection), both
    KVM and the vCPU need to be ready to accept the interrupt.

    ... and this is what the patch implements.

    Reported-by: David Woodhouse
    Analyzed-by: David Woodhouse
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Reviewed-by: Nikos Tsironis
    Reviewed-by: David Woodhouse
    Tested-by: David Woodhouse

    Paolo Bonzini
     
  • Centralize handling of interrupts from the userspace APIC
    in kvm_cpu_has_extint and kvm_cpu_get_extint, since
    userspace APIC interrupts are handled more or less the
    same as ExtINTs are with split irqchip. This removes
    duplicated code from kvm_cpu_has_injectable_intr and
    kvm_cpu_has_interrupt, and makes the code more similar
    between kvm_cpu_has_{extint,interrupt} on one side
    and kvm_cpu_get_{extint,interrupt} on the other.

    Cc: stable@vger.kernel.org
    Reviewed-by: Filippo Sironi
    Reviewed-by: David Woodhouse
    Tested-by: David Woodhouse
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

17 Nov, 2020

2 commits


16 Nov, 2020

2 commits

  • Pull kvm fixes from Paolo Bonzini:
    "Fixes for ARM and x86, the latter especially for old processors
    without two-dimensional paging (EPT/NPT)"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    kvm: mmu: fix is_tdp_mmu_check when the TDP MMU is not in use
    KVM: SVM: Update cr3_lm_rsvd_bits for AMD SEV guests
    KVM: x86: Introduce cr3_lm_rsvd_bits in kvm_vcpu_arch
    KVM: x86: clflushopt should be treated as a no-op by emulation
    KVM: arm64: Handle SCXTNUM_ELx traps
    KVM: arm64: Unify trap handlers injecting an UNDEF
    KVM: arm64: Allow setting of ID_AA64PFR0_EL1.CSV2 from userspace

    Linus Torvalds
     
  • Pull perf fixes from Thomas Gleixner:
    "A set of fixes for perf:

    - A set of commits which reduce the stack usage of various perf
    event handling functions which allocated large data structs on
    stack causing stack overflows in the worst case

    - Use the proper mechanism for detecting soft interrupts in the
    recursion protection

    - Make the resursion protection simpler and more robust

    - Simplify the scheduling of event groups to make the code more
    robust and prepare for fixing the issues vs. scheduling of
    exclusive event groups

    - Prevent event multiplexing and rotation for exclusive event groups

    - Correct the perf event attribute exclusive semantics to take
    pinned events, e.g. the PMU watchdog, into account

    - Make the anythread filtering conditional for Intel's generic PMU
    counters as it is not longer guaranteed to be supported on newer
    CPUs. Check the corresponding CPUID leaf to make sure

    - Fixup a duplicate initialization in an array which was probably
    caused by the usual 'copy & paste - forgot to edit' mishap"

    * tag 'perf-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/x86/intel/uncore: Fix Add BW copypasta
    perf/x86/intel: Make anythread filter support conditional
    perf: Tweak perf_event_attr::exclusive semantics
    perf: Fix event multiplexing for exclusive groups
    perf: Simplify group_sched_in()
    perf: Simplify group_sched_out()
    perf/x86: Make dummy_iregs static
    perf/arch: Remove perf_sample_data::regs_user_copy
    perf: Optimize get_recursion_context()
    perf: Fix get_recursion_context()
    perf/x86: Reduce stack usage for x86_pmu::drain_pebs()
    perf: Reduce stack usage of perf_output_begin()

    Linus Torvalds
     

15 Nov, 2020

1 commit

  • In some cases where shadow paging is in use, the root page will
    be either mmu->pae_root or vcpu->arch.mmu->lm_root. Then it will
    not have an associated struct kvm_mmu_page, because it is allocated
    with alloc_page instead of kvm_mmu_alloc_page.

    Just return false quickly from is_tdp_mmu_root if the TDP MMU is
    not in use, which also includes the case where shadow paging is
    enabled.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

13 Nov, 2020

3 commits

  • For AMD SEV guests, update the cr3_lm_rsvd_bits to mask
    the memory encryption bit in reserved bits.

    Signed-off-by: Babu Moger
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Babu Moger
     
  • SEV guests fail to boot on a system that supports the PCID feature.

    While emulating the RSM instruction, KVM reads the guest CR3
    and calls kvm_set_cr3(). If the vCPU is in the long mode,
    kvm_set_cr3() does a sanity check for the CR3 value. In this case,
    it validates whether the value has any reserved bits set. The
    reserved bit range is 63:cpuid_maxphysaddr(). When AMD memory
    encryption is enabled, the memory encryption bit is set in the CR3
    value. The memory encryption bit may fall within the KVM reserved
    bit range, causing the KVM emulation failure.

    Introduce a new field cr3_lm_rsvd_bits in kvm_vcpu_arch which will
    cache the reserved bits in the CR3 value. This will be initialized
    to rsvd_bits(cpuid_maxphyaddr(vcpu), 63).

    If the architecture has any special bits(like AMD SEV encryption bit)
    that needs to be masked from the reserved bits, should be cleared
    in vendor specific kvm_x86_ops.vcpu_after_set_cpuid handler.

    Fixes: a780a3ea628268b2 ("KVM: X86: Fix reserved bits check for MOV to CR3")
    Signed-off-by: Babu Moger
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Babu Moger
     
  • The instruction emulator ignores clflush instructions, yet fails to
    support clflushopt. Treat both similarly.

    Fixes: 13e457e0eebf ("KVM: x86: Emulator does not decode clflush well")
    Signed-off-by: David Edmondson
    Message-Id:
    Reviewed-by: Joao Martins
    Signed-off-by: Paolo Bonzini

    David Edmondson
     

10 Nov, 2020

1 commit

  • Starting with Arch Perfmon v5, the anythread filter on generic counters may be
    deprecated. The current kernel was exporting the any filter without checking.
    On Icelake, it means you could do cpu/event=0x3c,any/ even though the filter
    does not exist. This patch corrects the problem by relying on the CPUID 0xa leaf
    function to determine if anythread is supported or not as described in the
    Intel SDM Vol3b 18.2.5.1 AnyThread Deprecation section.

    Signed-off-by: Stephane Eranian
    Signed-off-by: Peter Zijlstra (Intel)
    Link: https://lkml.kernel.org/r/20201028194247.3160610-1-eranian@google.com

    Stephane Eranian
     

08 Nov, 2020

6 commits

  • Windows2016 guest tries to enable LBR by setting the corresponding bits
    in MSR_IA32_DEBUGCTLMSR. KVM does not emulate MSR_IA32_DEBUGCTLMSR and
    spams the host kernel logs with error messages like:

    kvm [...]: vcpu1, guest rIP: 0xfffff800a8b687d3 kvm_set_msr_common: MSR_IA32_DEBUGCTLMSR 0x1, nop"

    This patch fixes this by enabling error logging only with
    'report_ignored_msrs=1'.

    Signed-off-by: Pankaj Gupta
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Pankaj Gupta
     
  • Commit 5b9bb0ebbcdc ("kvm: x86: encapsulate wrmsr(MSR_KVM_SYSTEM_TIME)
    emulation in helper fn", 2020-10-21) subtly changed the behavior of guest
    writes to MSR_KVM_SYSTEM_TIME(_NEW). Restore the previous behavior; update
    the masterclock any time the guest uses a different msr than before.

    Fixes: 5b9bb0ebbcdc ("kvm: x86: encapsulate wrmsr(MSR_KVM_SYSTEM_TIME) emulation in helper fn", 2020-10-21)
    Signed-off-by: Oliver Upton
    Reviewed-by: Peter Shier
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Oliver Upton
     
  • Make the paravirtual cpuid enforcement mechanism idempotent to ioctl()
    ordering by updating pv_cpuid.features whenever userspace requests the
    capability. Extract this update out of kvm_update_cpuid_runtime() into a
    new helper function and move its other call site into
    kvm_vcpu_after_set_cpuid() where it more likely belongs.

    Fixes: 66570e966dd9 ("kvm: x86: only provide PV features if enabled in guest's CPUID")
    Signed-off-by: Oliver Upton
    Reviewed-by: Peter Shier
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Oliver Upton
     
  • commit 66570e966dd9 ("kvm: x86: only provide PV features if enabled in
    guest's CPUID") only protects against disallowed guest writes to KVM
    paravirtual msrs, leaving msr reads unchecked. Fix this by enforcing
    KVM_CPUID_FEATURES for msr reads as well.

    Fixes: 66570e966dd9 ("kvm: x86: only provide PV features if enabled in guest's CPUID")
    Signed-off-by: Oliver Upton
    Reviewed-by: Peter Shier
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Oliver Upton
     
  • Recent introduction of the userspace msr filtering added code that uses
    negative error codes for cases that result in either #GP delivery to
    the guest, or handled by the userspace msr filtering.

    This breaks an assumption that a negative error code returned from the
    msr emulation code is a semi-fatal error which should be returned
    to userspace via KVM_RUN ioctl and usually kill the guest.

    Fix this by reusing the already existing KVM_MSR_RET_INVALID error code,
    and by adding a new KVM_MSR_RET_FILTERED error code for the
    userspace filtered msrs.

    Fixes: 291f35fb2c1d1 ("KVM: x86: report negative values from wrmsr emulation to userspace")
    Reported-by: Qian Cai
    Signed-off-by: Maxim Levitsky
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Maxim Levitsky
     
  • Fix an off-by-one style bug in pte_list_add() where it failed to
    account the last full set of SPTEs, i.e. when desc->sptes is full
    and desc->more is NULL.

    Merge the two "PTE_LIST_EXT-1" checks as part of the fix to avoid
    an extra comparison.

    Signed-off-by: Li RongQing
    Reviewed-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Li RongQing
     

31 Oct, 2020

4 commits

  • Reported-by: kernel test robot
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • It was noticed that evmcs_sanitize_exec_ctrls() is not being executed
    nowadays despite the code checking 'enable_evmcs' static key looking
    correct. Turns out, static key magic doesn't work in '__init' section
    (and it is unclear when things changed) but setup_vmcs_config() is called
    only once per CPU so we don't really need it to. Switch to checking
    'enlightened_vmcs' instead, it is supposed to be in sync with
    'enable_evmcs'.

    Opportunistically make evmcs_sanitize_exec_ctrls '__init' and drop unneeded
    extra newline from it.

    Reported-by: Yang Weijiang
    Signed-off-by: Vitaly Kuznetsov
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Vitaly Kuznetsov
     
  • The newly introduced kvm_msr_ignored_check() tries to print error or
    debug messages via vcpu_*() macros, but those may cause Oops when NULL
    vcpu is passed for KVM_GET_MSRS ioctl.

    Fix it by replacing the print calls with kvm_*() macros.

    (Note that this will leave vcpu argument completely unused in the
    function, but I didn't touch it to make the fix as small as
    possible. A clean up may be applied later.)

    Fixes: 12bc2132b15e ("KVM: X86: Do the same ignore_msrs check for feature msrs")
    BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1178280
    Cc:
    Signed-off-by: Takashi Iwai
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Takashi Iwai
     
  • Even though the compiler is able to replace static const variables with
    their value, it will warn about them being unused when Linux is built with W=1.
    Use good old macros instead, this is not C++.

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     

25 Oct, 2020

1 commit


24 Oct, 2020

4 commits

  • During shutdown the IOAPIC trigger mode is reset to edge triggered
    while the vfio-pci INTx is still registered with a resampler.
    This allows us to get into an infinite loop:

    ioapic_set_irq
    -> ioapic_lazy_update_eoi
    -> kvm_ioapic_update_eoi_one
    -> kvm_notify_acked_irq
    -> kvm_notify_acked_gsi
    -> (via irq_acked fn ptr) irqfd_resampler_ack
    -> kvm_set_irq
    -> (via set fn ptr) kvm_set_ioapic_irq
    -> kvm_ioapic_set_irq
    -> ioapic_set_irq

    Commit 8be8f932e3db ("kvm: ioapic: Restrict lazy EOI update to
    edge-triggered interrupts", 2020-05-04) acknowledges that this recursion
    loop exists and tries to avoid it at the call to ioapic_lazy_update_eoi,
    but at this point the scenario is already set, we have an edge interrupt
    with resampler on the same gsi.

    Fortunately, the only user of irq ack notifiers (in addition to resamplefd)
    is i8254 timer interrupt reinjection. These are edge-triggered, so in
    principle they would need the call to kvm_ioapic_update_eoi_one from
    ioapic_lazy_update_eoi, but they already disable AVIC(*), so they don't
    need the lazy EOI behavior. Therefore, remove the call to
    kvm_ioapic_update_eoi_one from ioapic_lazy_update_eoi.

    This fixes CVE-2020-27152. Note that this issue cannot happen with
    SR-IOV assigned devices because virtual functions do not have INTx,
    only MSI.

    Fixes: f458d039db7e ("kvm: ioapic: Lazy update IOAPIC EOI")
    Suggested-by: Paolo Bonzini
    Tested-by: Alex Williamson
    Signed-off-by: Vitaly Kuznetsov
    Signed-off-by: Paolo Bonzini

    Vitaly Kuznetsov
     
  • allyesconfig results in:

    ld: drivers/block/paride/paride.o: in function `pi_init':
    (.text+0x1340): multiple definition of `pi_init'; arch/x86/kvm/vmx/posted_intr.o:posted_intr.c:(.init.text+0x0): first defined here
    make: *** [Makefile:1164: vmlinux] Error 1

    because commit:

    commit 8888cdd0996c2d51cd417f9a60a282c034f3fa28
    Author: Xiaoyao Li
    Date: Wed Sep 23 11:31:11 2020 -0700

    KVM: VMX: Extract posted interrupt support to separate files

    added another pi_init(), though one already existed in the paride code.

    Reported-by: Jens Axboe
    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • Replace a modulo operator with the more common pattern for computing the
    gfn "offset" of a huge page to fix an i386 build error.

    arch/x86/kvm/mmu/tdp_mmu.c:212: undefined reference to `__umoddi3'

    In fact, almost all of tdp_mmu.c can be elided on 32-bit builds, but
    that is a much larger patch.

    Fixes: 2f2fad0897cb ("kvm: x86/mmu: Add functions to handle changed TDP SPTEs")
    Reported-by: Daniel Díaz
    Signed-off-by: Sean Christopherson
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Sean Christopherson
     
  • Pull KVM updates from Paolo Bonzini:
    "For x86, there is a new alternative and (in the future) more scalable
    implementation of extended page tables that does not need a reverse
    map from guest physical addresses to host physical addresses.

    For now it is disabled by default because it is still lacking a few of
    the existing MMU's bells and whistles. However it is a very solid
    piece of work and it is already available for people to hammer on it.

    Other updates:

    ARM:
    - New page table code for both hypervisor and guest stage-2
    - Introduction of a new EL2-private host context
    - Allow EL2 to have its own private per-CPU variables
    - Support of PMU event filtering
    - Complete rework of the Spectre mitigation

    PPC:
    - Fix for running nested guests with in-kernel IRQ chip
    - Fix race condition causing occasional host hard lockup
    - Minor cleanups and bugfixes

    x86:
    - allow trapping unknown MSRs to userspace
    - allow userspace to force #GP on specific MSRs
    - INVPCID support on AMD
    - nested AMD cleanup, on demand allocation of nested SVM state
    - hide PV MSRs and hypercalls for features not enabled in CPUID
    - new test for MSR_IA32_TSC writes from host and guest
    - cleanups: MMU, CPUID, shared MSRs
    - LAPIC latency optimizations ad bugfixes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (232 commits)
    kvm: x86/mmu: NX largepage recovery for TDP MMU
    kvm: x86/mmu: Don't clear write flooding count for direct roots
    kvm: x86/mmu: Support MMIO in the TDP MMU
    kvm: x86/mmu: Support write protection for nesting in tdp MMU
    kvm: x86/mmu: Support disabling dirty logging for the tdp MMU
    kvm: x86/mmu: Support dirty logging for the TDP MMU
    kvm: x86/mmu: Support changed pte notifier in tdp MMU
    kvm: x86/mmu: Add access tracking for tdp_mmu
    kvm: x86/mmu: Support invalidate range MMU notifier for TDP MMU
    kvm: x86/mmu: Allocate struct kvm_mmu_pages for all pages in TDP MMU
    kvm: x86/mmu: Add TDP MMU PF handler
    kvm: x86/mmu: Remove disallowed_hugepage_adjust shadow_walk_iterator arg
    kvm: x86/mmu: Support zapping SPTEs in the TDP MMU
    KVM: Cache as_id in kvm_memory_slot
    kvm: x86/mmu: Add functions to handle changed TDP SPTEs
    kvm: x86/mmu: Allocate and free TDP MMU roots
    kvm: x86/mmu: Init / Uninit the TDP MMU
    kvm: x86/mmu: Introduce tdp_iter
    KVM: mmu: extract spte.h and spte.c
    KVM: mmu: Separate updating a PTE from kvm_set_pte_rmapp
    ...

    Linus Torvalds
     

23 Oct, 2020

2 commits

  • When KVM maps a largepage backed region at a lower level in order to
    make it executable (i.e. NX large page shattering), it reduces the TLB
    performance of that region. In order to avoid making this degradation
    permanent, KVM must periodically reclaim shattered NX largepages by
    zapping them and allowing them to be rebuilt in the page fault handler.

    With this patch, the TDP MMU does not respect KVM's rate limiting on
    reclaim. It traverses the entire TDP structure every time. This will be
    addressed in a future patch.

    Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
    machine. This series introduced no new failures.

    This series can be viewed in Gerrit at:
    https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

    Signed-off-by: Ben Gardon
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Ben Gardon
     
  • Direct roots don't have a write flooding count because the guest can't
    affect that paging structure. Thus there's no need to clear the write
    flooding count on a fast CR3 switch for direct roots.

    Tested by running kvm-unit-tests and KVM selftests on an Intel Haswell
    machine. This series introduced no new failures.

    This series can be viewed in Gerrit at:
    https://linux-review.googlesource.com/c/virt/kvm/kvm/+/2538

    Signed-off-by: Ben Gardon
    Message-Id:
    Signed-off-by: Paolo Bonzini

    Ben Gardon