12 Jan, 2011

1 commit


24 Oct, 2010

2 commits

  • Now that we have all the level interrupt magic in place, let's
    expose the capability to user space, so it can make use of it!

    Signed-off-by: Alexander Graf

    Alexander Graf
     
  • We need to tell the guest the opcodes that make up a hypercall through
    interfaces that are controlled by userspace. So we need to add a call
    for userspace to allow it to query those opcodes so it can pass them
    on.

    This is required because the hypercall opcodes can change based on
    the hypervisor conditions. If we're running in hardware accelerated
    hypervisor mode, a hypercall looks different from when we're running
    without hardware acceleration.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     

01 Aug, 2010

2 commits


17 May, 2010

3 commits

  • MOL uses its own hypercall interface to call back into userspace when
    the guest wants to do something.

    So let's implement that as an exit reason, specify it with a CAP and
    only really use it when userspace wants us to.

    The only user of it so far is MOL.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • Some times we don't want all capabilities to be available to all
    our vcpus. One example for that is the OSI interface, implemented
    in the next patch.

    In order to have a generic mechanism in how to enable capabilities
    individually, this patch introduces a new ioctl that can be used
    for this purpose. That way features we don't want in all guests or
    userspace configurations can just not be enabled and we're good.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • Userspace can tell us that it wants to trigger an interrupt. But
    so far it can't tell us that it wants to stop triggering one.

    So let's interpret the parameter to the ioctl that we have anyways
    to tell us if we want to raise or lower the interrupt line.

    Signed-off-by: Alexander Graf

    v2 -> v3:

    - Add CAP for unset irq
    Signed-off-by: Avi Kivity

    Alexander Graf
     

25 Apr, 2010

3 commits


01 Mar, 2010

6 commits


09 Dec, 2009

1 commit


08 Dec, 2009

1 commit

  • Currently userspace has no chance to find out which virtual address space we're
    in and resolve addresses. While that is a big problem for migration, it's also
    unpleasent when debugging, as gdb and the monitor don't work on virtual
    addresses.

    This patch exports enough of the MMU segment state to userspace to make
    debugging work and thus also includes the groundwork for migration.

    Signed-off-by: Alexander Graf
    Signed-off-by: Benjamin Herrenschmidt

    Alexander Graf
     

03 Dec, 2009

7 commits

  • This patch moves s390 processor status word into the base kvm_run
    struct and keeps it up-to date on all userspace exits.

    The userspace ABI is broken by this, however there are no applications
    in the wild using this. A capability check is provided so users can
    verify the updated API exists.

    Cc: stable@kernel.org
    Signed-off-by: Carsten Otte
    Signed-off-by: Avi Kivity

    Carsten Otte
     
  • This new IOCTL exports all yet user-invisible states related to
    exceptions, interrupts, and NMIs. Together with appropriate user space
    changes, this fixes sporadic problems of vmsave/restore, live migration
    and system reset.

    [avi: future-proof abi by adding a flags field]

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • These happen when we trap an exception when another exception is being
    delivered; we only expect these with MCEs and page faults. If something
    unexpected happens, things probably went south and we're better off reporting
    an internal error and freezing.

    Signed-off-by: Avi Kivity

    Avi Kivity
     
  • Usually userspace will freeze the guest so we can inspect it, but some
    internal state is not available. Add extra data to internal error
    reporting so we can expose it to the debugger. Extra data is specific
    to the suberror.

    Signed-off-by: Avi Kivity

    Avi Kivity
     
  • Obviously, people tend to extend this header at the bottom - more or
    less blindly. Ensure that deprecated stuff gets its own corner again by
    moving things to the top. Also add some comments and reindent IOCTLs to
    make them more readable and reduce the risk of number collisions.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • When we migrate a kvm guest that uses pvclock between two hosts, we may
    suffer a large skew. This is because there can be significant differences
    between the monotonic clock of the hosts involved. When a new host with
    a much larger monotonic time starts running the guest, the view of time
    will be significantly impacted.

    Situation is much worse when we do the opposite, and migrate to a host with
    a smaller monotonic clock.

    This proposed ioctl will allow userspace to inform us what is the monotonic
    clock value in the source host, so we can keep the time skew short, and
    more importantly, never goes backwards. Userspace may also need to trigger
    the current data, since from the first migration onwards, it won't be
    reflected by a simple call to clock_gettime() anymore.

    [marcelo: future-proof abi with a flags field]
    [jan: fix KVM_GET_CLOCK by clearing flags field instead of checking it]

    Signed-off-by: Glauber Costa
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Glauber Costa
     
  • Support for Xen PV-on-HVM guests can be implemented almost entirely in
    userspace, except for handling one annoying MSR that maps a Xen
    hypercall blob into guest address space.

    A generic mechanism to delegate MSR writes to userspace seems overkill
    and risks encouraging similar MSR abuse in the future. Thus this patch
    adds special support for the Xen HVM MSR.

    I implemented a new ioctl, KVM_XEN_HVM_CONFIG, that lets userspace tell
    KVM which MSR the guest will write to, as well as the starting address
    and size of the hypercall blobs (one each for 32-bit and 64-bit) that
    userspace has loaded from files. When the guest writes to the MSR, KVM
    copies one page of the blob from userspace to the guest.

    I've tested this patch with a hacked-up version of Gerd's userspace
    code, booting a number of guests (CentOS 5.3 i386 and x86_64, and
    FreeBSD 8.0-RC1 amd64) and exercising PV network and block devices.

    [jan: fix i386 build warning]
    [avi: future proof abi with a flags field]

    Signed-off-by: Ed Swierk
    Signed-off-by: Jan Kiszka
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Ed Swierk
     

10 Sep, 2009

11 commits

  • Now KVM allow guest to modify guest's physical address of EPT's identity mapping page.

    (change from v1, discard unnecessary check, change ioctl to accept parameter
    address rather than value)

    Signed-off-by: Sheng Yang
    Signed-off-by: Marcelo Tosatti

    Sheng Yang
     
  • ioeventfd is a mechanism to register PIO/MMIO regions to trigger an eventfd
    signal when written to by a guest. Host userspace can register any
    arbitrary IO address with a corresponding eventfd and then pass the eventfd
    to a specific end-point of interest for handling.

    Normal IO requires a blocking round-trip since the operation may cause
    side-effects in the emulated model or may return data to the caller.
    Therefore, an IO in KVM traps from the guest to the host, causes a VMX/SVM
    "heavy-weight" exit back to userspace, and is ultimately serviced by qemu's
    device model synchronously before returning control back to the vcpu.

    However, there is a subclass of IO which acts purely as a trigger for
    other IO (such as to kick off an out-of-band DMA request, etc). For these
    patterns, the synchronous call is particularly expensive since we really
    only want to simply get our notification transmitted asychronously and
    return as quickly as possible. All the sychronous infrastructure to ensure
    proper data-dependencies are met in the normal IO case are just unecessary
    overhead for signalling. This adds additional computational load on the
    system, as well as latency to the signalling path.

    Therefore, we provide a mechanism for registration of an in-kernel trigger
    point that allows the VCPU to only require a very brief, lightweight
    exit just long enough to signal an eventfd. This also means that any
    clients compatible with the eventfd interface (which includes userspace
    and kernelspace equally well) can now register to be notified. The end
    result should be a more flexible and higher performance notification API
    for the backend KVM hypervisor and perhipheral components.

    To test this theory, we built a test-harness called "doorbell". This
    module has a function called "doorbell_ring()" which simply increments a
    counter for each time the doorbell is signaled. It supports signalling
    from either an eventfd, or an ioctl().

    We then wired up two paths to the doorbell: One via QEMU via a registered
    io region and through the doorbell ioctl(). The other is direct via
    ioeventfd.

    You can download this test harness here:

    ftp://ftp.novell.com/dev/ghaskins/doorbell.tar.bz2

    The measured results are as follows:

    qemu-mmio: 110000 iops, 9.09us rtt
    ioeventfd-mmio: 200100 iops, 5.00us rtt
    ioeventfd-pio: 367300 iops, 2.72us rtt

    I didn't measure qemu-pio, because I have to figure out how to register a
    PIO region with qemu's device model, and I got lazy. However, for now we
    can extrapolate based on the data from the NULLIO runs of +2.56us for MMIO,
    and -350ns for HC, we get:

    qemu-pio: 153139 iops, 6.53us rtt
    ioeventfd-hc: 412585 iops, 2.37us rtt

    these are just for fun, for now, until I can gather more data.

    Here is a graph for your convenience:

    http://developer.novell.com/wiki/images/7/76/Iofd-chart.png

    The conclusion to draw is that we save about 4us by skipping the userspace
    hop.

    --------------------

    Signed-off-by: Gregory Haskins
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Avi Kivity

    Gregory Haskins
     
  • When kvm is in hpet_legacy_mode, the hpet is providing the timer
    interrupt and the pit should not be. So in legacy mode, the pit timer
    is destroyed, but the *state* of the pit is maintained. So if kvm or
    the guest tries to modify the state of the pit, this modification is
    accepted, *except* that the timer isn't actually started. When we exit
    hpet_legacy_mode, the current state of the pit (which is up to date
    since we've been accepting modifications) is used to restart the pit
    timer.

    The saved_mode code in kvm_pit_load_count temporarily changes mode to
    0xff in order to destroy the timer, but then restores the actual
    value, again maintaining "current" state of the pit for possible later
    reenablement.

    [avi: add some reserved storage in the ioctl; make SET_PIT2 IOW]
    [marcelo: fix memory corruption due to reserved storage]

    Signed-off-by: Beth Kon
    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Beth Kon
     
  • Return EOPNOTSUPP for KVM_TRACE_ENABLE/PAUSE/DISABLE ioctls.

    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Avi Kivity

    Marcelo Tosatti
     
  • Instead of mindlessly retrying to execute the instruction, report the
    failure to userspace.

    Signed-off-by: Avi Kivity

    Avi Kivity
     
  • Archs are free to use vcpu_id as they see fit. For x86 it is used as
    vcpu's apic id. New ioctl is added to configure boot vcpu id that was
    assumed to be 0 till now.

    Signed-off-by: Gleb Natapov
    Signed-off-by: Avi Kivity

    Gleb Natapov
     
  • Somehow the VM ioctls got unsorted; resort.

    Signed-off-by: Avi Kivity

    Avi Kivity
     
  • We only trap one page for MSI-X entry now, so it's 4k/(128/8) = 256 entries at
    most.

    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • The in-kernel speaker emulation is only a dummy and also unneeded from
    the performance point of view. Rather, it takes user space support to
    generate sound output on the host, e.g. console beeps.

    To allow this, introduce KVM_CREATE_PIT2 which controls in-kernel
    speaker port emulation via a flag passed along the new IOCTL. It also
    leaves room for future extensions of the PIT configuration interface.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Avi Kivity

    Jan Kiszka
     
  • KVM provides a complete virtual system environment for guests, including
    support for injecting interrupts modeled after the real exception/interrupt
    facilities present on the native platform (such as the IDT on x86).
    Virtual interrupts can come from a variety of sources (emulated devices,
    pass-through devices, etc) but all must be injected to the guest via
    the KVM infrastructure. This patch adds a new mechanism to inject a specific
    interrupt to a guest using a decoupled eventfd mechnanism: Any legal signal
    on the irqfd (using eventfd semantics from either userspace or kernel) will
    translate into an injected interrupt in the guest at the next available
    interrupt window.

    Signed-off-by: Gregory Haskins
    Signed-off-by: Avi Kivity

    Gregory Haskins
     
  • The related MSRs are emulated. MCE capability is exported via
    extension KVM_CAP_MCE and ioctl KVM_X86_GET_MCE_CAP_SUPPORTED. A new
    vcpu ioctl command KVM_X86_SETUP_MCE is used to setup MCE emulation
    such as the mcg_cap. MCE is injected via vcpu ioctl command
    KVM_X86_SET_MCE. Extended machine-check state (MCG_EXT_P) and CMCI are
    not implemented.

    Signed-off-by: Huang Ying
    Signed-off-by: Avi Kivity

    Huang Ying
     

10 Jun, 2009

3 commits

  • Two things needed fixing: 1) g++ does not allow a named structure type
    within an anonymous union and 2) Avoid name clash between two padding
    fields within the same struct by giving them different names as is
    done elsewhere in the header.

    Signed-off-by: Nathan Binkert
    Signed-off-by: Avi Kivity

    nathan binkert
     
  • After discussion with Marcelo, we decided to rework device assignment framework
    together. The old problems are kernel logic is unnecessary complex. So Marcelo
    suggest to split it into a more elegant way:

    1. Split host IRQ assign and guest IRQ assign. And userspace determine the
    combination. Also discard msi2intx parameter, userspace can specific
    KVM_DEV_IRQ_HOST_MSI | KVM_DEV_IRQ_GUEST_INTX in assigned_irq->flags to
    enable MSI to INTx convertion.

    2. Split assign IRQ and deassign IRQ. Import two new ioctls:
    KVM_ASSIGN_DEV_IRQ and KVM_DEASSIGN_DEV_IRQ.

    This patch also fixed the reversed _IOR vs _IOW in definition(by deprecated the
    old interface).

    [avi: replace homemade bitcount() by hweight_long()]

    Signed-off-by: Marcelo Tosatti
    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • This patch finally enable MSI-X.

    What we need for MSI-X:
    1. Intercept one page in MMIO region of device. So that we can get guest desired
    MSI-X table and set up the real one. Now this have been done by guest, and
    transfer to kernel using ioctl KVM_SET_MSIX_NR and KVM_SET_MSIX_ENTRY.

    2. Information for incoming interrupt. Now one device can have more than one
    interrupt, and they are all handled by one workqueue structure. So we need to
    identify them. The previous patch enable gsi_msg_pending_bitmap get this done.

    3. Mapping from host IRQ to guest gsi as well as guest gsi to real MSI/MSI-X
    message address/data. We used same entry number for the host and guest here, so
    that it's easy to find the correlated guest gsi.

    What we lack for now:
    1. The PCI spec said nothing can existed with MSI-X table in the same page of
    MMIO region, except pending bits. The patch ignore pending bits as the first
    step (so they are always 0 - no pending).

    2. The PCI spec allowed to change MSI-X table dynamically. That means, the OS
    can enable MSI-X, then mask one MSI-X entry, modify it, and unmask it. The patch
    didn't support this, and Linux also don't work in this way.

    3. The patch didn't implement MSI-X mask all and mask single entry. I would
    implement the former in driver/pci/msi.c later. And for single entry, userspace
    should have reposibility to handle it.

    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang