06 Sep, 2014

2 commits

  • commit 350b8bdd689cd2ab2c67c8a86a0be86cfa0751a7 upstream.

    The third parameter of kvm_iommu_put_pages is wrong,
    It should be 'gfn - slot->base_gfn'.

    By making gfn very large, malicious guest or userspace can cause kvm to
    go to this error path, and subsequently to pass a huge value as size.
    Alternatively if gfn is small, then pages would be pinned but never
    unpinned, causing host memory leak and local DOS.

    Passing a reasonable but large value could be the most dangerous case,
    because it would unpin a page that should have stayed pinned, and thus
    allow the device to DMA into arbitrary memory. However, this cannot
    happen because of the condition that can trigger the error:

    - out of memory (where you can't allocate even a single page)
    should not be possible for the attacker to trigger

    - when exceeding the iommu's address space, guest pages after gfn
    will also exceed the iommu's address space, and inside
    kvm_iommu_put_pages() the iommu_iova_to_phys() will fail. The
    page thus would not be unpinned at all.

    Reported-by: Jack Morgenstein
    Signed-off-by: Michael S. Tsirkin
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Michael S. Tsirkin
     
  • commit 0f6c0a740b7d3e1f3697395922d674000f83d060 upstream.

    Currently, the EOI exit bitmap (used for APICv) does not include
    interrupts that are masked. However, this can cause a bug that manifests
    as an interrupt storm inside the guest. Alex Williamson reported the
    bug and is the one who really debugged this; I only wrote the patch. :)

    The scenario involves a multi-function PCI device with OHCI and EHCI
    USB functions and an audio function, all assigned to the guest, where
    both USB functions use legacy INTx interrupts.

    As soon as the guest boots, interrupts for these devices turn into an
    interrupt storm in the guest; the host does not see the interrupt storm.
    Basically the EOI path does not work, and the guest continues to see the
    interrupt over and over, even after it attempts to mask it at the APIC.
    The bug is only visible with older kernels (RHEL6.5, based on 2.6.32
    with not many changes in the area of APIC/IOAPIC handling).

    Alex then tried forcing bit 59 (corresponding to the USB functions' IRQ)
    on in the eoi_exit_bitmap and TMR, and things then work. What happens
    is that VFIO asserts IRQ11, then KVM recomputes the EOI exit bitmap.
    It does not have set bit 59 because the RTE was masked, so the IOAPIC
    never sees the EOI and the interrupt continues to fire in the guest.

    My guess was that the guest is masking the interrupt in the redirection
    table in the interrupt routine, i.e. while the interrupt is set in a
    LAPIC's ISR, The simplest fix is to ignore the masking state, we would
    rather have an unnecessary exit rather than a missed IRQ ACK and anyway
    IOAPIC interrupts are not as performance-sensitive as for example MSIs.
    Alex tested this patch and it fixed his bug.

    [Thanks to Alex for his precise description of the problem
    and initial debugging effort. A lot of the text above is
    based on emails exchanged with him.]

    Reported-by: Alex Williamson
    Tested-by: Alex Williamson
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Paolo Bonzini
     

13 May, 2014

3 commits

  • commit 5678de3f15010b9022ee45673f33bcfc71d47b60 upstream.

    QE reported that they got the BUG_ON in ioapic_service to trigger.
    I cannot reproduce it, but there are two reasons why this could happen.

    The less likely but also easiest one, is when kvm_irq_delivery_to_apic
    does not deliver to any APIC and returns -1.

    Because irqe.shorthand == 0, the kvm_for_each_vcpu loop in that
    function is never reached. However, you can target the similar loop in
    kvm_irq_delivery_to_apic_fast; just program a zero logical destination
    address into the IOAPIC, or an out-of-range physical destination address.

    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Paolo Bonzini
     
  • commit 41c22f626254b9dc0376928cae009e73d1b6a49a upstream.

    get_user_pages(mm) is simply wrong if mm->mm_users == 0 and exit_mmap/etc
    was already called (or is in progress), mm->mm_count can only pin mm->pgd
    and mm_struct itself.

    Change kvm_setup_async_pf/async_pf_execute to inc/dec mm->mm_users.

    kvm_create_vm/kvm_destroy_vm play with ->mm_count too but this case looks
    fine at first glance, it seems that this ->mm is only used to verify that
    current->mm == kvm->mm.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Greg Kroah-Hartman

    Oleg Nesterov
     
  • commit 91021a6c8ffdc55804dab5acdfc7de4f278b9ac3 upstream.

    When dispatch SGI(mode == 0), that is the vcpu of VM should send
    sgi to the cpu which the target_cpus list.
    So, there must add the "break" to branch of case 0.

    Signed-off-by: Haibin Wang
    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall
    Signed-off-by: Greg Kroah-Hartman

    Haibin Wang
     

14 Feb, 2014

1 commit

  • This fixes the build breakage introduced by
    c07a0191ef2de1f9510f12d1f88e3b0b5cd8d66f and adds support for the device
    control API and save/restore of the VGIC state for ARMv8.

    The defines were simply missing from the arm64 header files and
    uaccess.h must be implicitly imported from somewhere else on arm.

    Signed-off-by: Christoffer Dall
    Signed-off-by: Paolo Bonzini

    Christoffer Dall
     

30 Jan, 2014

1 commit


23 Jan, 2014

1 commit

  • Pull KVM updates from Paolo Bonzini:
    "First round of KVM updates for 3.14; PPC parts will come next week.

    Nothing major here, just bugfixes all over the place. The most
    interesting part is the ARM guys' virtualized interrupt controller
    overhaul, which lets userspace get/set the state and thus enables
    migration of ARM VMs"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (67 commits)
    kvm: make KVM_MMU_AUDIT help text more readable
    KVM: s390: Fix memory access error detection
    KVM: nVMX: Update guest activity state field on L2 exits
    KVM: nVMX: Fix nested_run_pending on activity state HLT
    KVM: nVMX: Clean up handling of VMX-related MSRs
    KVM: nVMX: Add tracepoints for nested_vmexit and nested_vmexit_inject
    KVM: nVMX: Pass vmexit parameters to nested_vmx_vmexit
    KVM: nVMX: Leave VMX mode on clearing of feature control MSR
    KVM: VMX: Fix DR6 update on #DB exception
    KVM: SVM: Fix reading of DR6
    KVM: x86: Sync DR7 on KVM_SET_DEBUGREGS
    add support for Hyper-V reference time counter
    KVM: remove useless write to vcpu->hv_clock.tsc_timestamp
    KVM: x86: fix tsc catchup issue with tsc scaling
    KVM: x86: limit PIT timer frequency
    KVM: x86: handle invalid root_hpa everywhere
    kvm: Provide kvm_vcpu_eligible_for_directed_yield() stub
    kvm: vfio: silence GCC warning
    KVM: ARM: Remove duplicate include
    arm/arm64: KVM: relax the requirements of VMA alignment for THP
    ...

    Linus Torvalds
     

15 Jan, 2014

2 commits

  • Commit 7940876e1330671708186ac3386aa521ffb5c182 ("kvm: make local
    functions static") broke KVM PPC builds due to removing (rather than
    moving) the stub version of kvm_vcpu_eligible_for_directed_yield().

    This patch reintroduces it.

    Signed-off-by: Scott Wood
    Cc: Stephen Hemminger
    Cc: Alexander Graf
    [Move the #ifdef inside the function. - Paolo]
    Signed-off-by: Paolo Bonzini

    Scott Wood
     
  • Building vfio.o triggers a GCC warning (when building for 32 bits x86):
    arch/x86/kvm/../../../virt/kvm/vfio.c: In function 'kvm_vfio_set_group':
    arch/x86/kvm/../../../virt/kvm/vfio.c:104:22: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
    void __user *argp = (void __user *)arg;
    ^

    Silence this warning by casting arg to unsigned long.

    argp's current type, "void __user *", is always casted to "int32_t
    __user *". So its type might as well be changed to "int32_t __user *".

    Signed-off-by: Paul Bolle
    Signed-off-by: Paolo Bonzini

    Paul Bolle
     

09 Jan, 2014

2 commits


22 Dec, 2013

10 commits

  • Implement support for the CPU interface register access driven by MMIO
    address offsets from the CPU interface base address. Useful for user
    space to support save/restore of the VGIC state.

    This commit adds support only for the same logic as the current VGIC
    support, and no more. For example, the active priority registers are
    handled as RAZ/WI, just like setting priorities on the emulated
    distributor.

    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • Handle MMIO accesses to the two registers which should support both the
    case where the VMs want to read/write either of these registers and the
    case where user space reads/writes these registers to do save/restore of
    the VGIC state.

    Note that the added complexity compared to simple set/clear enable
    registers stems from the bookkeping of source cpu ids. It may be
    possible to change the underlying data structure to simplify the
    complexity, but since this is not in the critical path at all, this will
    do.

    Also note that reading this register from a live guest will not be
    accurate compared to on hardware, because some state may be living on
    the CPU LRs and the only way to give a consistent read would be to force
    stop all the VCPUs and request them to unqueu the LR state onto the
    distributor. Until we have an actual user of live reading this
    register, we can live with the difference.

    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • To properly access the VGIC state from user space it is very unpractical
    to have to loop through all the LRs in all register access functions.
    Instead, support moving all pending state from LRs to the distributor,
    but leave active state LRs alone.

    Note that to accurately present the active and pending state to VCPUs
    reading these distributor registers from a live VM, we would have to
    stop all other VPUs than the calling VCPU and ask each CPU to unqueue
    their LR state onto the distributor and add fields to track active state
    on the distributor side as well. We don't have any users of such
    functionality yet and there are other inaccuracies of the GIC emulation,
    so don't provide accurate synchronized access to this state just yet.
    However, when the time comes, having this function should help.

    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • Add infrastructure to handle distributor and cpu interface register
    accesses through the KVM_{GET/SET}_DEVICE_ATTR interface by adding the
    KVM_DEV_ARM_VGIC_GRP_DIST_REGS and KVM_DEV_ARM_VGIC_GRP_CPU_REGS groups
    and defining the semantics of the attr field to be the MMIO offset as
    specified in the GICv2 specs.

    Missing register accesses or other changes in individual register access
    functions to support save/restore of the VGIC state is added in
    subsequent patches.

    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • Rename the vgic_ranges array to vgic_dist_ranges to be more specific and
    to prepare for handling CPU interface register access as well (for
    save/restore of VGIC state).

    Pass offset from distributor or interface MMIO base to
    find_matching_range function instead of the physical address of the
    access in the VM memory map. This allows other callers unaware of the
    VM specifics, but with generic VGIC knowledge to reuse the function.

    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • Support setting the distributor and cpu interface base addresses in the
    VM physical address space through the KVM_{SET,GET}_DEVICE_ATTR API
    in addition to the ARM specific API.

    This has the added benefit of being able to share more code in user
    space and do things in a uniform manner.

    Also deprecate the older API at the same time, but backwards
    compatibility will be maintained.

    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • Support creating the ARM VGIC device through the KVM_CREATE_DEVICE
    ioctl, which can then later be leveraged to use the
    KVM_{GET/SET}_DEVICE_ATTR, which is useful both for setting addresses in
    a more generic API than the ARM-specific one and is useful for
    save/restore of VGIC state.

    Adds KVM_CAP_DEVICE_CTRL to ARM capabilities.

    Note that we change the check for creating a VGIC from bailing out if
    any VCPUs were created, to bailing out if any VCPUs were ever run. This
    is an important distinction that shouldn't break anything, but allows
    creating the VGIC after the VCPUs have been created.

    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • Rework the VGIC initialization slightly to allow initialization of the
    vgic cpu-specific state even if the irqchip (the VGIC) hasn't been
    created by user space yet. This is safe, because the vgic data
    structures are already allocated when the CPU is allocated if VGIC
    support is compiled into the kernel. Further, the init process does not
    depend on any other information and the sacrifice is a slight
    performance degradation for creating VMs in the no-VGIC case.

    The reason is that the new device control API doesn't mandate creating
    the VGIC before creating the VCPU and it is unreasonable to require user
    space to create the VGIC before creating the VCPUs.

    At the same time move the irqchip_in_kernel check out of
    kvm_vcpu_first_run_init and into the init function to make the per-vcpu
    and global init functions symmetric and add comments on the exported
    functions making it a bit easier to understand the init flow by only
    looking at vgic.c.

    Acked-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • For migration to work we need to save (and later restore) the state of
    each core's virtual generic timer.
    Since this is per VCPU, we can use the [gs]et_one_reg ioctl and export
    the three needed registers (control, counter, compare value).
    Though they live in cp15 space, we don't use the existing list, since
    they need special accessor functions and the arch timer is optional.

    Acked-by: Marc Zynger
    Signed-off-by: Andre Przywara
    Signed-off-by: Christoffer Dall

    Andre Przywara
     
  • Initialize the cntvoff at kvm_init_vm time, not before running the VCPUs
    at the first time because that will overwrite any potentially restored
    values from user space.

    Cc: Andre Przywara
    Acked-by: Marc Zynger
    Signed-off-by: Christoffer Dall

    Christoffer Dall
     

13 Dec, 2013

2 commits

  • Since the commit 15ad7146 ("KVM: Use the scheduler preemption notifiers
    to make kvm preemptible"), the remaining stuff in this function is a
    simple cond_resched() call with an extra need_resched() check which was
    there to avoid dropping VCPUs unnecessarily. Now it is meaningless.

    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Paolo Bonzini

    Takuya Yoshikawa
     
  • In multiple functions the vcpu_id is used as an offset into a bitfield. Ag
    malicious user could specify a vcpu_id greater than 255 in order to set or
    clear bits in kernel memory. This could be used to elevate priveges in the
    kernel. This patch verifies that the vcpu_id provided is less than 255.
    The api documentation already specifies that the vcpu_id must be less than
    max_vcpus, but this is currently not checked.

    Reported-by: Andrew Honig
    Cc: stable@vger.kernel.org
    Signed-off-by: Andrew Honig
    Signed-off-by: Paolo Bonzini

    Andy Honig
     

21 Nov, 2013

1 commit

  • Using the address of 'empty_zero_page' as source address in order to
    clear a page is wrong. On some architectures empty_zero_page is only the
    pointer to the struct page of the empty_zero_page. Therefore the clear
    page operation would copy the contents of a couple of struct pages instead
    of clearing a page. For kvm only arm/arm64 are affected by this bug.

    To fix this use the ZERO_PAGE macro instead which will return the struct
    page address of the empty_zero_page on all architectures.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Gleb Natapov

    Heiko Carstens
     

15 Nov, 2013

1 commit

  • Pull KVM changes from Paolo Bonzini:
    "Here are the 3.13 KVM changes. There was a lot of work on the PPC
    side: the HV and emulation flavors can now coexist in a single kernel
    is probably the most interesting change from a user point of view.

    On the x86 side there are nested virtualization improvements and a few
    bugfixes.

    ARM got transparent huge page support, improved overcommit, and
    support for big endian guests.

    Finally, there is a new interface to connect KVM with VFIO. This
    helps with devices that use NoSnoop PCI transactions, letting the
    driver in the guest execute WBINVD instructions. This includes some
    nVidia cards on Windows, that fail to start without these patches and
    the corresponding userspace changes"

    * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
    kvm, vmx: Fix lazy FPU on nested guest
    arm/arm64: KVM: PSCI: propagate caller endianness to the incoming vcpu
    arm/arm64: KVM: MMIO support for BE guest
    kvm, cpuid: Fix sparse warning
    kvm: Delete prototype for non-existent function kvm_check_iopl
    kvm: Delete prototype for non-existent function complete_pio
    hung_task: add method to reset detector
    pvclock: detect watchdog reset at pvclock read
    kvm: optimize out smp_mb after srcu_read_unlock
    srcu: API for barrier after srcu read unlock
    KVM: remove vm mmap method
    KVM: IOMMU: hva align mapping page size
    KVM: x86: trace cpuid emulation when called from emulator
    KVM: emulator: cleanup decode_register_operand() a bit
    KVM: emulator: check rex prefix inside decode_register()
    KVM: x86: fix emulation of "movzbl %bpl, %eax"
    kvm_host: typo fix
    KVM: x86: emulate SAHF instruction
    MAINTAINERS: add tree for kvm.git
    Documentation/kvm: add a 00-INDEX file
    ...

    Linus Torvalds
     

06 Nov, 2013

1 commit

  • It was used in conjunction with KVM_SET_MEMORY_REGION ioctl which was
    removed by b74a07beed0 in 2010, QEMU stopped using it in 2008, so
    it is time to remove the code finally.

    Signed-off-by: Gleb Natapov

    Gleb Natapov
     

05 Nov, 2013

1 commit

  • When determining the page size we could use to map with the IOMMU, the
    page size should also be aligned with the hva, not just the gfn. The
    gfn may not reflect the real alignment within the hugetlbfs file.

    Most of the time, this works fine. However, if the hugetlbfs file is
    backed by non-contiguous huge pages, a multi-huge page memslot starts at
    an unaligned offset within the hugetlbfs file, and the gfn is aligned
    with respect to the huge page size, kvm_host_page_size() will return the
    huge page size and we will use that to map with the IOMMU.

    When we later unpin that same memslot, the IOMMU returns the unmap size
    as the huge page size, and we happily unpin that many pfns in
    monotonically increasing order, not realizing we are spanning
    non-contiguous huge pages and partially unpin the wrong huge page.

    Ensure the IOMMU mapping page size is aligned with the hva corresponding
    to the gfn, which does reflect the alignment within the hugetlbfs file.

    Reviewed-by: Marcelo Tosatti
    Signed-off-by: Greg Edwards
    Cc: stable@vger.kernel.org
    Signed-off-by: Gleb Natapov

    Greg Edwards
     

04 Nov, 2013

1 commit


31 Oct, 2013

3 commits

  • We currently use some ad-hoc arch variables tied to legacy KVM device
    assignment to manage emulation of instructions that depend on whether
    non-coherent DMA is present. Create an interface for this, adapting
    legacy KVM device assignment and adding VFIO via the KVM-VFIO device.
    For now we assume that non-coherent DMA is possible any time we have a
    VFIO group. Eventually an interface can be developed as part of the
    VFIO external user interface to query the coherency of a group.

    Signed-off-by: Alex Williamson
    Signed-off-by: Paolo Bonzini

    Alex Williamson
     
  • Default to operating in coherent mode. This simplifies the logic when
    we switch to a model of registering and unregistering noncoherent I/O
    with KVM.

    Signed-off-by: Alex Williamson
    Signed-off-by: Paolo Bonzini

    Alex Williamson
     
  • So far we've succeeded at making KVM and VFIO mostly unaware of each
    other, but areas are cropping up where a connection beyond eventfds
    and irqfds needs to be made. This patch introduces a KVM-VFIO device
    that is meant to be a gateway for such interaction. The user creates
    the device and can add and remove VFIO groups to it via file
    descriptors. When a group is added, KVM verifies the group is valid
    and gets a reference to it via the VFIO external user interface.

    Signed-off-by: Alex Williamson
    Signed-off-by: Paolo Bonzini

    Alex Williamson
     

30 Oct, 2013

1 commit


28 Oct, 2013

1 commit

  • In kvm_iommu_map_pages(), we need to know the page size via call
    kvm_host_page_size(). And it will check whether the target slot
    is valid before return the right page size.
    Currently, we will map the iommu pages when creating a new slot.
    But we call kvm_iommu_map_pages() during preparing the new slot.
    At that time, the new slot is not visible by domain(still in preparing).
    So we cannot get the right page size from kvm_host_page_size() and
    this will break the IOMMU super page logic.
    The solution is to map the iommu pages after we insert the new slot
    into domain.

    Signed-off-by: Yang Zhang
    Tested-by: Patrick Lu
    Signed-off-by: Paolo Bonzini

    Yang Zhang
     

17 Oct, 2013

3 commits


15 Oct, 2013

1 commit

  • Page pinning is not mandatory in kvm async page fault processing since
    after async page fault event is delivered to a guest it accesses page once
    again and does its own GUP. Drop the FOLL_GET flag in GUP in async_pf
    code, and do some simplifying in check/clear processing.

    Suggested-by: Gleb Natapov
    Signed-off-by: Gu zheng
    Signed-off-by: chai wen
    Signed-off-by: Gleb Natapov

    chai wen
     

03 Oct, 2013

2 commits

  • When KVM (de)assigns PCI(e) devices to VMs, a debug message is printed
    including the BDF notation of the respective device. Currently, the BDF
    notation does not have the commonly used leading zeros. This produces
    messages like "assign device 0:1:8.0", which look strange at first sight.

    The patch fixes this by exchanging the printk(KERN_DEBUG ...) with dev_info()
    and also inserts "kvm" into the debug message, so that it is obvious where
    the message comes from. Also reduces LoC.

    Acked-by: Alex Williamson
    Signed-off-by: Andre Richter
    Signed-off-by: Gleb Natapov

    Andre Richter
     
  • gfn_to_memslot() can return NULL or invalid slot. We need to check slot
    validity before accessing it.

    Reviewed-by: Paolo Bonzini
    Signed-off-by: Gleb Natapov

    Gleb Natapov