03 Apr, 2019

1 commit

  • commit ddba91801aeb5c160b660caed1800eb3aef403f8 upstream.

    KVM's API requires thats ioctls must be issued from the same process
    that created the VM. In other words, userspace can play games with a
    VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the
    creator can do anything useful. Explicitly reject device ioctls that
    are issued by a process other than the VM's creator, and update KVM's
    API documentation to extend its requirements to device ioctls.

    Fixes: 852b6d57dc7f ("kvm: add device control API")
    Cc:
    Signed-off-by: Sean Christopherson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Sean Christopherson
     

20 Sep, 2018

1 commit

  • Add KVM_CAP_MSR_PLATFORM_INFO so that userspace can disable guest access
    to reads of MSR_PLATFORM_INFO.

    Disabling access to reads of this MSR gives userspace the control to "expose"
    this platform-dependent information to guests in a clear way. As it exists
    today, guests that read this MSR would get unpopulated information if userspace
    hadn't already set it (and prior to this patch series, only the CPUID faulting
    information could have been populated). This existing interface could be
    confusing if guests don't handle the potential for incorrect/incomplete
    information gracefully (e.g. zero reported for base frequency).

    Signed-off-by: Drew Schmitt
    Signed-off-by: Paolo Bonzini

    Drew Schmitt
     

12 Sep, 2018

1 commit

  • We currently do not notify all gmaps when using gmap_pmdp_xchg(), due
    to locking constraints. This makes ucontrol VMs, which is the only VM
    type that creates multiple gmaps, incompatible with huge pages. Also
    we would need to hold the guest_table_lock of all gmaps that have this
    vmaddr maped to synchronize access to the pmd.

    ucontrol VMs are rather exotic and creating a new locking concept is
    no easy task. Hence we return EINVAL when trying to active
    KVM_CAP_S390_HPAGE_1M and report it as being not available when
    checking for it.

    Fixes: a4499382 ("KVM: s390: Add huge page enablement control")
    Signed-off-by: Janosch Frank
    Reviewed-by: David Hildenbrand
    Reviewed-by: Claudio Imbrenda
    Message-Id:
    Signed-off-by: Janosch Frank

    Janosch Frank
     

22 Aug, 2018

2 commits


06 Aug, 2018

2 commits

  • Using hypercall to send IPIs by one vmexit instead of one by one for
    xAPIC/x2APIC physical mode and one vmexit per-cluster for x2APIC cluster
    mode. Intel guest can enter x2apic cluster mode when interrupt remmaping
    is enabled in qemu, however, latest AMD EPYC still just supports xapic
    mode which can get great improvement by Exit-less IPIs. This patchset
    lets a guest send multicast IPIs, with at most 128 destinations per
    hypercall in 64-bit mode and 64 vCPUs per hypercall in 32-bit mode.

    Hardware: Xeon Skylake 2.5GHz, 2 sockets, 40 cores, 80 threads, the VM
    is 80 vCPUs, IPI microbenchmark(https://lkml.org/lkml/2017/12/19/141):

    x2apic cluster mode, vanilla

    Dry-run: 0, 2392199 ns
    Self-IPI: 6907514, 15027589 ns
    Normal IPI: 223910476, 251301666 ns
    Broadcast IPI: 0, 9282161150 ns
    Broadcast lock: 0, 8812934104 ns

    x2apic cluster mode, pv-ipi

    Dry-run: 0, 2449341 ns
    Self-IPI: 6720360, 15028732 ns
    Normal IPI: 228643307, 255708477 ns
    Broadcast IPI: 0, 7572293590 ns => 22% performance boost
    Broadcast lock: 0, 8316124651 ns

    x2apic physical mode, vanilla

    Dry-run: 0, 3135933 ns
    Self-IPI: 8572670, 17901757 ns
    Normal IPI: 226444334, 255421709 ns
    Broadcast IPI: 0, 19845070887 ns
    Broadcast lock: 0, 19827383656 ns

    x2apic physical mode, pv-ipi

    Dry-run: 0, 2446381 ns
    Self-IPI: 6788217, 15021056 ns
    Normal IPI: 219454441, 249583458 ns
    Broadcast IPI: 0, 7806540019 ns => 154% performance boost
    Broadcast lock: 0, 9143618799 ns

    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Vitaly Kuznetsov
    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     
  • For nested virtualization L0 KVM is managing a bit of state for L2 guests,
    this state can not be captured through the currently available IOCTLs. In
    fact the state captured through all of these IOCTLs is usually a mix of L1
    and L2 state. It is also dependent on whether the L2 guest was running at
    the moment when the process was interrupted to save its state.

    With this capability, there are two new vcpu ioctls: KVM_GET_NESTED_STATE
    and KVM_SET_NESTED_STATE. These can be used for saving and restoring a VM
    that is in VMX operation.

    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: H. Peter Anvin
    Cc: x86@kernel.org
    Cc: kvm@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Jim Mattson
    [karahmed@ - rename structs and functions and make them ready for AMD and
    address previous comments.
    - handle nested.smm state.
    - rebase & a bit of refactoring.
    - Merge 7/8 and 8/8 into one patch. ]
    Signed-off-by: KarimAllah Ahmed
    Signed-off-by: Paolo Bonzini

    Jim Mattson
     

31 Jul, 2018

1 commit

  • General KVM huge page support on s390 has to be enabled via the
    kvm.hpage module parameter. Either nested or hpage can be enabled, as
    we currently do not support vSIE for huge backed guests. Once the vSIE
    support is added we will either drop the parameter or enable it as
    default.

    For a guest the feature has to be enabled through the new
    KVM_CAP_S390_HPAGE_1M capability and the hpage module
    parameter. Enabling it means that cmm can't be enabled for the vm and
    disables pfmf and storage key interpretation.

    This is due to the fact that in some cases, in upcoming patches, we
    have to split huge pages in the guest mapping to be able to set more
    granular memory protection on 4k pages. These split pages have fake
    page tables that are not visible to the Linux memory management which
    subsequently will not manage its PGSTEs, while the SIE will. Disabling
    these features lets us manage PGSTE data in a consistent matter and
    solve that problem.

    Signed-off-by: Janosch Frank
    Reviewed-by: David Hildenbrand

    Janosch Frank
     

21 Jul, 2018

4 commits

  • arm64's new use of KVMs get_events/set_events API calls isn't just
    or RAS, it allows an SError that has been made pending by KVM as
    part of its device emulation to be migrated.

    Wire this up for 32bit too.

    We only need to read/write the HCR_VA bit, and check that no esr has
    been provided, as we don't yet support VDFSR.

    Signed-off-by: James Morse
    Reviewed-by: Dongjiu Geng
    Signed-off-by: Marc Zyngier

    James Morse
     
  • For the arm64 RAS Extension, user space can inject a virtual-SError
    with specified ESR. So user space needs to know whether KVM support
    to inject such SError, this interface adds this query for this capability.

    KVM will check whether system support RAS Extension, if supported, KVM
    returns true to user space, otherwise returns false.

    Signed-off-by: Dongjiu Geng
    Reviewed-by: James Morse
    [expanded documentation wording]
    Signed-off-by: James Morse
    Signed-off-by: Marc Zyngier

    Dongjiu Geng
     
  • For the migrating VMs, user space may need to know the exception
    state. For example, in the machine A, KVM make an SError pending,
    when migrate to B, KVM also needs to pend an SError.

    This new IOCTL exports user-invisible states related to SError.
    Together with appropriate user space changes, user space can get/set
    the SError exception state to do migrate/snapshot/suspend.

    Signed-off-by: Dongjiu Geng
    Reviewed-by: James Morse
    [expanded documentation wording]
    Signed-off-by: James Morse
    Signed-off-by: Marc Zyngier

    Dongjiu Geng
     
  • Update the documentation to reflect the ordering requirements of
    restoring the GICD_IIDR register before any other registers and the
    effects this has on restoring the interrupt groups for an emulated GICv2
    instance.

    Also remove some outdated limitations in the documentation while we're
    at it.

    Reviewed-by: Andrew Jones
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     

22 Jun, 2018

1 commit


02 Jun, 2018

3 commits


26 May, 2018

3 commits


25 May, 2018

1 commit

  • We introduce a new KVM_VGIC_V3_ADDR_TYPE_REDIST_REGION attribute in
    KVM_DEV_ARM_VGIC_GRP_ADDR group. It allows userspace to provide the
    base address and size of a redistributor region

    Compared to KVM_VGIC_V3_ADDR_TYPE_REDIST, this new attribute allows
    to declare several separate redistributor regions.

    So the whole redist space does not need to be contiguous anymore.

    Signed-off-by: Eric Auger
    Reviewed-by: Peter Maydell
    Acked-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Eric Auger
     

18 May, 2018

1 commit

  • KVM_HINTS_DEDICATED seems to be somewhat confusing:

    Guest doesn't really care whether it's the only task running on a host
    CPU as long as it's not preempted.

    And there are more reasons for Guest to be preempted than host CPU
    sharing, for example, with memory overcommit it can get preempted on a
    memory access, post copy migration can cause preemption, etc.

    Let's call it KVM_HINTS_REALTIME which seems to better
    match what guests expect.

    Also, the flag most be set on all vCPUs - current guests assume this.
    Note so in the documentation.

    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Paolo Bonzini

    Michael S. Tsirkin
     

20 Apr, 2018

1 commit

  • Although we've implemented PSCI 0.1, 0.2 and 1.0, we expose either 0.1
    or 1.0 to a guest, defaulting to the latest version of the PSCI
    implementation that is compatible with the requested version. This is
    no different from doing a firmware upgrade on KVM.

    But in order to give a chance to hypothetical badly implemented guests
    that would have a fit by discovering something other than PSCI 0.2,
    let's provide a new API that allows userspace to pick one particular
    version of the API.

    This is implemented as a new class of "firmware" registers, where
    we expose the PSCI version. This allows the PSCI version to be
    save/restored as part of a guest migration, and also set to
    any supported version if the guest requires it.

    Cc: stable@vger.kernel.org #4.16
    Reviewed-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Marc Zyngier
     

29 Mar, 2018

1 commit


17 Mar, 2018

3 commits

  • If host CPUs are dedicated to a VM, we can avoid VM exits on HLT.
    This patch adds the per-VM capability to disable them.

    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Jan H. Schönherr
    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     
  • Allowing a guest to execute MWAIT without interception enables a guest
    to put a (physical) CPU into a power saving state, where it takes
    longer to return from than what may be desired by the host.

    Don't give a guest that power over a host by default. (Especially,
    since nothing prevents a guest from using MWAIT even when it is not
    advertised via CPUID.)

    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Jan H. Schönherr
    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini

    Wanpeng Li
     
  • …t/kvms390/linux into HEAD

    KVM: s390: fixes and features

    - more kvm stat counters
    - virtio gpu plumbing. The 3 non-KVM/s390 patches have Acks from
    Bartlomiej Zolnierkiewicz, Heiko Carstens and Greg Kroah-Hartman
    but all belong together to make virtio-gpu work as a tty. So
    I carried them in the KVM/s390 tree.
    - document some KVM_CAPs
    - cpu-model only facilities
    - cleanups

    Paolo Bonzini
     

15 Mar, 2018

1 commit

  • commit 35b3fde6203b ("KVM: s390: wire up bpb feature") has no
    documentation for KVM_CAP_S390_BPB. While adding this let's also add
    other missing capabilities like KVM_CAP_S390_PSW, KVM_CAP_S390_GMAP and
    KVM_CAP_S390_COW.

    Reviewed-by: Cornelia Huck
    Reviewed-by: David Hildenbrand
    Reviewed-by: Janosch Frank
    Signed-off-by: Christian Borntraeger

    Christian Borntraeger
     

07 Mar, 2018

4 commits

  • This patch introduces kvm_para_has_hint() to query for hints about
    the configuration of the guests. The first hint KVM_HINTS_DEDICATED,
    is set if the guest has dedicated physical CPUs for each vCPU (i.e.
    pinning and no over-commitment). This allows optimizing spinlocks
    and tells the guest to avoid PV TLB flush.

    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Eduardo Habkost
    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Radim Krčmář

    Wanpeng Li
     
  • This commit implements an enhanced x86 version of S390
    KVM_CAP_SYNC_REGS functionality. KVM_CAP_SYNC_REGS "allow[s]
    userspace to access certain guest registers without having
    to call SET/GET_*REGS”. This reduces ioctl overhead which
    is particularly important when userspace is making synchronous
    guest state modifications (e.g. when emulating and/or intercepting
    instructions).

    Originally implemented upstream for the S390, the x86 differences
    follow:
    - userspace can select the register sets to be synchronized with kvm_run
    using bit-flags in the kvm_valid_registers and kvm_dirty_registers
    fields.
    - vcpu_events is available in addition to the regs and sregs register
    sets.

    Signed-off-by: Ken Hofsass
    Reviewed-by: David Hildenbrand
    [Removed wrapper around check for reserved kvm_valid_regs. - Radim]
    Signed-off-by: Radim Krčmář

    Ken Hofsass
     
  • Replace hardcoded padding size value for struct kvm_sync_regs
    with #define SYNC_REGS_SIZE_BYTES.

    Also update the value specified in api.txt from outdated hardcoded
    value to SYNC_REGS_SIZE_BYTES.

    Signed-off-by: Ken Hofsass
    Reviewed-by: David Hildenbrand
    Acked-by: Christian Borntraeger
    Signed-off-by: Radim Krčmář

    Ken Hofsass
     
  • In Hyper-V, the fast guest->host notification mechanism is the
    SIGNAL_EVENT hypercall, with a single parameter of the connection ID to
    signal.

    Currently this hypercall incurs a user exit and requires the userspace
    to decode the parameters and trigger the notification of the potentially
    different I/O context.

    To avoid the costly user exit, process this hypercall and signal the
    corresponding eventfd in KVM, similar to ioeventfd. The association
    between the connection id and the eventfd is established via the newly
    introduced KVM_HYPERV_EVENTFD ioctl, and maintained in an
    (srcu-protected) IDR.

    Signed-off-by: Roman Kagan
    Reviewed-by: David Hildenbrand
    [asm/hyperv.h changes approved by KY Srinivasan. - Radim]
    Signed-off-by: Radim Krčmář

    Roman Kagan
     

02 Mar, 2018

1 commit

  • Provide a new KVM capability that allows bits within MSRs to be recognized
    as features. Two new ioctls are added to the /dev/kvm ioctl routine to
    retrieve the list of these MSRs and then retrieve their values. A kvm_x86_ops
    callback is used to determine support for the listed MSR-based features.

    Signed-off-by: Tom Lendacky
    Signed-off-by: Paolo Bonzini
    [Tweaked documentation. - Radim]
    Signed-off-by: Radim Krčmář

    Tom Lendacky
     

24 Feb, 2018

1 commit

  • Guests on new hypersiors might set KVM_ASYNC_PF_DELIVERY_AS_PF_VMEXIT
    bit when enabling async_PF, but this bit is reserved on old hypervisors,
    which results in a failure upon migration.

    To avoid breaking different cases, we are checking for CPUID feature bit
    before enabling the feature and nothing else.

    Fixes: 52a5c155cf79 ("KVM: async_pf: Let guest support delivery of async_pf from guest mode")
    Cc:
    Reviewed-by: Wanpeng Li
    Reviewed-by: David Hildenbrand
    Signed-off-by: Radim Krčmář
    Signed-off-by: Paolo Bonzini

    Radim Krčmář
     

01 Feb, 2018

2 commits


31 Jan, 2018

1 commit


19 Jan, 2018

2 commits

  • This adds a new ioctl, KVM_PPC_GET_CPU_CHAR, that gives userspace
    information about the underlying machine's level of vulnerability
    to the recently announced vulnerabilities CVE-2017-5715,
    CVE-2017-5753 and CVE-2017-5754, and whether the machine provides
    instructions to assist software to work around the vulnerabilities.

    The ioctl returns two u64 words describing characteristics of the
    CPU and required software behaviour respectively, plus two mask
    words which indicate which bits have been filled in by the kernel,
    for extensibility. The bit definitions are the same as for the
    new H_GET_CPU_CHARACTERISTICS hypercall.

    There is also a new capability, KVM_CAP_PPC_GET_CPU_CHAR, which
    indicates whether the new ioctl is available.

    Signed-off-by: Paul Mackerras

    Paul Mackerras
     
  • This merges in the ppc-kvm topic branch of the powerpc tree to get
    two patches which are prerequisites for the following patch series,
    plus another patch which touches both powerpc and KVM code.

    Signed-off-by: Paul Mackerras

    Paul Mackerras
     

16 Jan, 2018

2 commits

  • This part of Secure Encrypted Virtualization (SEV) patch series focuses on KVM
    changes required to create and manage SEV guests.

    SEV is an extension to the AMD-V architecture which supports running encrypted
    virtual machine (VMs) under the control of a hypervisor. Encrypted VMs have their
    pages (code and data) secured such that only the guest itself has access to
    unencrypted version. Each encrypted VM is associated with a unique encryption key;
    if its data is accessed to a different entity using a different key the encrypted
    guest's data will be incorrectly decrypted, leading to unintelligible data.
    This security model ensures that hypervisor will no longer able to inspect or
    alter any guest code or data.

    The key management of this feature is handled by a separate processor known as
    the AMD Secure Processor (AMD-SP) which is present on AMD SOCs. The SEV Key
    Management Specification (see below) provides a set of commands which can be
    used by hypervisor to load virtual machine keys through the AMD-SP driver.

    The patch series adds a new ioctl in KVM driver (KVM_MEMORY_ENCRYPT_OP). The
    ioctl will be used by qemu to issue SEV guest-specific commands defined in Key
    Management Specification.

    The following links provide additional details:

    AMD Memory Encryption white paper:
    http://amd-dev.wpengine.netdna-cdn.com/wordpress/media/2013/12/AMD_Memory_Encryption_Whitepaper_v7-Public.pdf

    AMD64 Architecture Programmer's Manual:
    http://support.amd.com/TechDocs/24593.pdf
    SME is section 7.10
    SEV is section 15.34

    SEV Key Management:
    http://support.amd.com/TechDocs/55766_SEV-KM API_Specification.pdf

    KVM Forum Presentation:
    http://www.linux-kvm.org/images/7/74/02x08A-Thomas_Lendacky-AMDs_Virtualizatoin_Memory_Encryption_Technology.pdf

    SEV Guest BIOS support:
    SEV support has been add to EDKII/OVMF BIOS
    https://github.com/tianocore/edk2

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • Remote TLB flush does a busy wait which is fine in bare-metal
    scenario. But with-in the guest, the vcpus might have been pre-empted or
    blocked. In this scenario, the initator vcpu would end up busy-waiting
    for a long amount of time; it also consumes CPU unnecessarily to wake
    up the target of the shootdown.

    This patch set adds support for KVM's new paravirtualized TLB flush;
    remote TLB flush does not wait for vcpus that are sleeping, instead
    KVM will flush the TLB as soon as the vCPU starts running again.

    The improvement is clearly visible when the host is overcommitted; in this
    case, the PV TLB flush (in addition to avoiding the wait on the main CPU)
    prevents preempted vCPUs from stealing precious execution time from the
    running ones.

    Testing on a Xeon Gold 6142 2.6GHz 2 sockets, 32 cores, 64 threads,
    so 64 pCPUs, and each VM is 64 vCPUs.

    ebizzy -M
    vanilla optimized boost
    1VM 46799 48670 4%
    2VM 23962 42691 78%
    3VM 16152 37539 132%

    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Peter Zijlstra
    Signed-off-by: Wanpeng Li
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Radim Krčmář

    Wanpeng Li