26 Dec, 2011

1 commit

  • Unlike all of the other cpuid bits, the TSC deadline timer bit is set
    unconditionally, regardless of what userspace wants.

    This is broken in several ways:
    - if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC
    deadline timer feature, a guest that uses the feature will break
    - live migration to older host kernels that don't support the TSC deadline
    timer will cause the feature to be pulled from under the guest's feet;
    breaking it
    - guests that are broken wrt the feature will fail.

    Fix by not enabling the feature automatically; instead report it to userspace.
    Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee
    will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not
    KVM_GET_SUPPORTED_CPUID.

    Fixes the Illumos guest kernel, which uses the TSC deadline timer feature.

    [avi: add the KVM_CAP + documentation]

    Reported-by: Alexey Zaytsev
    Tested-by: Alexey Zaytsev
    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     

17 Nov, 2011

1 commit


30 Oct, 2011

1 commit

  • Implement sigp external call, which might be required for guests that
    issue an external call instead of an emergency signal for IPI.

    This fixes an issue with "KVM: unknown SIGP: 0x02" when booting
    such an SMP guest.

    Signed-off-by: Christian Ehrhardt
    Signed-off-by: Christian Borntraeger
    Signed-off-by: Marcelo Tosatti

    Christian Ehrhardt
     

26 Sep, 2011

3 commits

  • Now that Book3S PV mode can also run PAPR guests, we can add a PAPR cap and
    enable it for all Book3S targets. Enabling that CAP switches KVM into PAPR
    mode.

    Signed-off-by: Alexander Graf

    Alexander Graf
     
  • Until now, we always set HIOR based on the PVR, but this is just wrong.
    Instead, we should be setting HIOR explicitly, so user space can decide
    what the initial HIOR value is - just like on real hardware.

    We keep the old PVR based way around for backwards compatibility, but
    once user space uses the SREGS based method, we drop the PVR logic.

    Signed-off-by: Alexander Graf

    Alexander Graf
     
  • The patch raises the hard limit of VCPU count to 254.

    This will allow developers to easily work on scalability
    and will allow users to test high VCPU setups easily without
    patching the kernel.

    To prevent possible issues with current setups, KVM_CAP_NR_VCPUS
    now returns the recommended VCPU limit (which is still 64) - this
    should be a safe value for everybody, while a new KVM_CAP_MAX_VCPUS
    returns the hard limit which is now 254.

    Cc: Avi Kivity
    Cc: Ingo Molnar
    Cc: Marcelo Tosatti
    Cc: Pekka Enberg
    Suggested-by: Pekka Enberg
    Signed-off-by: Sasha Levin
    Signed-off-by: Marcelo Tosatti

    Sasha Levin
     

20 Sep, 2011

1 commit

  • 598841ca9919d008b520114d8a4378c4ce4e40a1 ([S390] use gmap address
    spaces for kvm guest images) changed kvm on s390 to use a separate
    address space for kvm guests. We can now put KVM guests anywhere
    in the user address mode with a size up to 8PB - as long as the
    memory is 1MB-aligned. This change was done without KVM extension
    capability bit.
    The change was added after 3.0, but we still have a chance to add
    a feature bit before 3.1 (keeping the releases in a sane state).
    We use number 71 to avoid collisions with other pending kvm patches
    as requested by Alexander Graf.

    Signed-off-by: Christian Borntraeger
    Acked-by: Avi Kivity
    Cc: Alexander Graf
    Signed-off-by: Heiko Carstens

    Christian Borntraeger
     

12 Jul, 2011

5 commits

  • This adds infrastructure which will be needed to allow book3s_hv KVM to
    run on older POWER processors, including PPC970, which don't support
    the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
    Offset (RMO) facility. These processors require a physically
    contiguous, aligned area of memory for each guest. When the guest does
    an access in real mode (MMU off), the address is compared against a
    limit value, and if it is lower, the address is ORed with an offset
    value (from the Real Mode Offset Register (RMOR)) and the result becomes
    the real address for the access. The size of the RMA has to be one of
    a set of supported values, which usually includes 64MB, 128MB, 256MB
    and some larger powers of 2.

    Since we are unlikely to be able to allocate 64MB or more of physically
    contiguous memory after the kernel has been running for a while, we
    allocate a pool of RMAs at boot time using the bootmem allocator. The
    size and number of the RMAs can be set using the kvm_rma_size=xx and
    kvm_rma_count=xx kernel command line options.

    KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
    of the pool of preallocated RMAs. The capability value is 1 if the
    processor can use an RMA but doesn't require one (because it supports
    the VRMA facility), or 2 if the processor requires an RMA for each guest.

    This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
    pool and returns a file descriptor which can be used to map the RMA. It
    also returns the size of the RMA in the argument structure.

    Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
    ioctl calls from userspace. To cope with this, we now preallocate the
    kvm->arch.ram_pginfo array when the VM is created with a size sufficient
    for up to 64GB of guest memory. Subsequently we will get rid of this
    array and use memory associated with each memslot instead.

    This moves most of the code that translates the user addresses into
    host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
    to kvmppc_core_prepare_memory_region. Also, instead of having to look
    up the VMA for each page in order to check the page size, we now check
    that the pages we get are compound pages of 16MB. However, if we are
    adding memory that is mapped to an RMA, we don't bother with calling
    get_user_pages_fast and instead just offset from the base pfn for the
    RMA.

    Typically the RMA gets added after vcpus are created, which makes it
    inconvenient to have the LPCR (logical partition control register) value
    in the vcpu->arch struct, since the LPCR controls whether the processor
    uses RMA or VRMA for the guest. This moves the LPCR value into the
    kvm->arch struct and arranges for the MER (mediated external request)
    bit, which is the only bit that varies between vcpus, to be set in
    assembly code when going into the guest if there is a pending external
    interrupt request.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This lifts the restriction that book3s_hv guests can only run one
    hardware thread per core, and allows them to use up to 4 threads
    per core on POWER7. The host still has to run single-threaded.

    This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
    capability. The return value of the ioctl querying this capability
    is the number of vcpus per virtual CPU core (vcore), currently 4.

    To use this, the host kernel should be booted with all threads
    active, and then all the secondary threads should be offlined.
    This will put the secondary threads into nap mode. KVM will then
    wake them from nap mode and use them for running guest code (while
    they are still offline). To wake the secondary threads, we send
    them an IPI using a new xics_wake_cpu() function, implemented in
    arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
    we assume that the platform has a XICS interrupt controller and
    we are using icp-native.c to drive it. Since the woken thread will
    need to acknowledge and clear the IPI, we also export the base
    physical address of the XICS registers using kvmppc_set_xics_phys()
    for use in the low-level KVM book3s code.

    When a vcpu is created, it is assigned to a virtual CPU core.
    The vcore number is obtained by dividing the vcpu number by the
    number of threads per core in the host. This number is exported
    to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
    to run the guest in single-threaded mode, it should make all vcpu
    numbers be multiples of the number of threads per core.

    We distinguish three states of a vcpu: runnable (i.e., ready to execute
    the guest), blocked (that is, idle), and busy in host. We currently
    implement a policy that the vcore can run only when all its threads
    are runnable or blocked. This way, if a vcpu needs to execute elsewhere
    in the kernel or in qemu, it can do so without being starved of CPU
    by the other vcpus.

    When a vcore starts to run, it executes in the context of one of the
    vcpu threads. The other vcpu threads all go to sleep and stay asleep
    until something happens requiring the vcpu thread to return to qemu,
    or to wake up to run the vcore (this can happen when another vcpu
    thread goes from busy in host state to blocked).

    It can happen that a vcpu goes from blocked to runnable state (e.g.
    because of an interrupt), and the vcore it belongs to is already
    running. In that case it can start to run immediately as long as
    the none of the vcpus in the vcore have started to exit the guest.
    We send the next free thread in the vcore an IPI to get it to start
    to execute the guest. It synchronizes with the other threads via
    the vcore->entry_exit_count field to make sure that it doesn't go
    into the guest if the other vcpus are exiting by the time that it
    is ready to actually enter the guest.

    Note that there is no fixed relationship between the hardware thread
    number and the vcpu number. Hardware threads are assigned to vcpus
    as they become runnable, so we will always use the lower-numbered
    hardware threads in preference to higher-numbered threads if not all
    the vcpus in the vcore are runnable, regardless of which vcpus are
    runnable.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This improves I/O performance for guests using the PAPR
    paravirtualization interface by making the H_PUT_TCE hcall faster, by
    implementing it in real mode. H_PUT_TCE is used for updating virtual
    IOMMU tables, and is used both for virtual I/O and for real I/O in the
    PAPR interface.

    Since this moves the IOMMU tables into the kernel, we define a new
    KVM_CREATE_SPAPR_TCE ioctl to allow qemu to create the tables. The
    ioctl returns a file descriptor which can be used to mmap the newly
    created table. The qemu driver models use them in the same way as
    userspace managed tables, but they can be updated directly by the
    guest with a real-mode H_PUT_TCE implementation, reducing the number
    of host/guest context switches during guest IO.

    There are certain circumstances where it is useful for userland qemu
    to write to the TCE table even if the kernel H_PUT_TCE path is used
    most of the time. Specifically, allowing this will avoid awkwardness
    when we need to reset the table. More importantly, we will in the
    future need to write the table in order to restore its state after a
    checkpoint resume or migration.

    Signed-off-by: David Gibson
    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    David Gibson
     
  • This adds support for KVM running on 64-bit Book 3S processors,
    specifically POWER7, in hypervisor mode. Using hypervisor mode means
    that the guest can use the processor's supervisor mode. That means
    that the guest can execute privileged instructions and access privileged
    registers itself without trapping to the host. This gives excellent
    performance, but does mean that KVM cannot emulate a processor
    architecture other than the one that the hardware implements.

    This code assumes that the guest is running paravirtualized using the
    PAPR (Power Architecture Platform Requirements) interface, which is the
    interface that IBM's PowerVM hypervisor uses. That means that existing
    Linux distributions that run on IBM pSeries machines will also run
    under KVM without modification. In order to communicate the PAPR
    hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
    to include/linux/kvm.h.

    Currently the choice between book3s_hv support and book3s_pr support
    (i.e. the existing code, which runs the guest in user mode) has to be
    made at kernel configuration time, so a given kernel binary can only
    do one or the other.

    This new book3s_hv code doesn't support MMIO emulation at present.
    Since we are running paravirtualized guests, this isn't a serious
    restriction.

    With the guest running in supervisor mode, most exceptions go straight
    to the guest. We will never get data or instruction storage or segment
    interrupts, alignment interrupts, decrementer interrupts, program
    interrupts, single-step interrupts, etc., coming to the hypervisor from
    the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
    exception entry path so that we don't have to do the KVM test on entry
    to those exception handlers.

    We do however get hypervisor decrementer, hypervisor data storage,
    hypervisor instruction storage, and hypervisor emulation assist
    interrupts, so we have to handle those.

    In hypervisor mode, real-mode accesses can access all of RAM, not just
    a limited amount. Therefore we put all the guest state in the vcpu.arch
    and use the shadow_vcpu in the PACA only for temporary scratch space.
    We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
    anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
    We don't have a shared page with the guest, but we still need a
    kvm_vcpu_arch_shared struct to store the values of various registers,
    so we include one in the vcpu_arch struct.

    The POWER7 processor has a restriction that all threads in a core have
    to be in the same partition. MMU-on kernel code counts as a partition
    (partition 0), so we have to do a partition switch on every entry to and
    exit from the guest. At present we require the host and guest to run
    in single-thread mode because of this hardware restriction.

    This code allocates a hashed page table for the guest and initializes
    it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
    require that the guest memory is allocated using 16MB huge pages, in
    order to simplify the low-level memory management. This also means that
    we can get away without tracking paging activity in the host for now,
    since huge pages can't be paged or swapped.

    This also adds a few new exports needed by the book3s_hv code.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • Neither host_irq nor the guest_msi struct are used anymore today.
    Tag the former, drop the latter to avoid confusion.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     

22 May, 2011

1 commit


11 May, 2011

1 commit


12 Jan, 2011

1 commit


24 Oct, 2010

2 commits

  • Now that we have all the level interrupt magic in place, let's
    expose the capability to user space, so it can make use of it!

    Signed-off-by: Alexander Graf

    Alexander Graf
     
  • We need to tell the guest the opcodes that make up a hypercall through
    interfaces that are controlled by userspace. So we need to add a call
    for userspace to allow it to query those opcodes so it can pass them
    on.

    This is required because the hypercall opcodes can change based on
    the hypervisor conditions. If we're running in hardware accelerated
    hypervisor mode, a hypercall looks different from when we're running
    without hardware acceleration.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     

01 Aug, 2010

2 commits


17 May, 2010

3 commits

  • MOL uses its own hypercall interface to call back into userspace when
    the guest wants to do something.

    So let's implement that as an exit reason, specify it with a CAP and
    only really use it when userspace wants us to.

    The only user of it so far is MOL.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • Some times we don't want all capabilities to be available to all
    our vcpus. One example for that is the OSI interface, implemented
    in the next patch.

    In order to have a generic mechanism in how to enable capabilities
    individually, this patch introduces a new ioctl that can be used
    for this purpose. That way features we don't want in all guests or
    userspace configurations can just not be enabled and we're good.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • Userspace can tell us that it wants to trigger an interrupt. But
    so far it can't tell us that it wants to stop triggering one.

    So let's interpret the parameter to the ioctl that we have anyways
    to tell us if we want to raise or lower the interrupt line.

    Signed-off-by: Alexander Graf

    v2 -> v3:

    - Add CAP for unset irq
    Signed-off-by: Avi Kivity

    Alexander Graf
     

25 Apr, 2010

3 commits


01 Mar, 2010

6 commits


09 Dec, 2009

1 commit


08 Dec, 2009

1 commit

  • Currently userspace has no chance to find out which virtual address space we're
    in and resolve addresses. While that is a big problem for migration, it's also
    unpleasent when debugging, as gdb and the monitor don't work on virtual
    addresses.

    This patch exports enough of the MMU segment state to userspace to make
    debugging work and thus also includes the groundwork for migration.

    Signed-off-by: Alexander Graf
    Signed-off-by: Benjamin Herrenschmidt

    Alexander Graf
     

03 Dec, 2009

7 commits

  • This patch moves s390 processor status word into the base kvm_run
    struct and keeps it up-to date on all userspace exits.

    The userspace ABI is broken by this, however there are no applications
    in the wild using this. A capability check is provided so users can
    verify the updated API exists.

    Cc: stable@kernel.org
    Signed-off-by: Carsten Otte
    Signed-off-by: Avi Kivity

    Carsten Otte
     
  • This new IOCTL exports all yet user-invisible states related to
    exceptions, interrupts, and NMIs. Together with appropriate user space
    changes, this fixes sporadic problems of vmsave/restore, live migration
    and system reset.

    [avi: future-proof abi by adding a flags field]

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • These happen when we trap an exception when another exception is being
    delivered; we only expect these with MCEs and page faults. If something
    unexpected happens, things probably went south and we're better off reporting
    an internal error and freezing.

    Signed-off-by: Avi Kivity

    Avi Kivity
     
  • Usually userspace will freeze the guest so we can inspect it, but some
    internal state is not available. Add extra data to internal error
    reporting so we can expose it to the debugger. Extra data is specific
    to the suberror.

    Signed-off-by: Avi Kivity

    Avi Kivity
     
  • Obviously, people tend to extend this header at the bottom - more or
    less blindly. Ensure that deprecated stuff gets its own corner again by
    moving things to the top. Also add some comments and reindent IOCTLs to
    make them more readable and reduce the risk of number collisions.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • When we migrate a kvm guest that uses pvclock between two hosts, we may
    suffer a large skew. This is because there can be significant differences
    between the monotonic clock of the hosts involved. When a new host with
    a much larger monotonic time starts running the guest, the view of time
    will be significantly impacted.

    Situation is much worse when we do the opposite, and migrate to a host with
    a smaller monotonic clock.

    This proposed ioctl will allow userspace to inform us what is the monotonic
    clock value in the source host, so we can keep the time skew short, and
    more importantly, never goes backwards. Userspace may also need to trigger
    the current data, since from the first migration onwards, it won't be
    reflected by a simple call to clock_gettime() anymore.

    [marcelo: future-proof abi with a flags field]
    [jan: fix KVM_GET_CLOCK by clearing flags field instead of checking it]

    Signed-off-by: Glauber Costa
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Glauber Costa
     
  • Support for Xen PV-on-HVM guests can be implemented almost entirely in
    userspace, except for handling one annoying MSR that maps a Xen
    hypercall blob into guest address space.

    A generic mechanism to delegate MSR writes to userspace seems overkill
    and risks encouraging similar MSR abuse in the future. Thus this patch
    adds special support for the Xen HVM MSR.

    I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell
    KVM which MSR the guest will write to, as well as the starting address
    and size of the hypercall blobs (one each for 32-bit and 64-bit) that
    userspace has loaded from files. When the guest writes to the MSR, KVM
    copies one page of the blob from userspace to the guest.

    I've tested this patch with a hacked-up version of Gerd's userspace
    code, booting a number of guests (CentOS 5.3 i386 and x86_64, and
    FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices.

    [jan: fix i386 build warning]
    [avi: future proof abi with a flags field]

    Signed-off-by: Ed Swierk
    Signed-off-by: Jan Kiszka
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Ed Swierk