15 Jul, 2016

4 commits

  • Once anon_inode_getfd() has succeeded, it's impossible to undo
    in a clean way and no, sys_close() is not usable in such cases.
    Use anon_inode_getfile() and get_unused_fd_flags() to get struct file
    and descriptor and do *not* install the file into the descriptor table
    until after the last possible failure exit.

    Signed-off-by: Paolo Bonzini

    Al Viro
     
  • This reverts commit 77ecc085fed1af1000ca719522977b960aa6da52.

    Al Viro colorfully says: "You should *NEVER* use sys_close() on failure
    exit paths like that. Moreover, this kvm_put_kvm() becomes a double-put,
    since closing the damn file will drop that reference to kvm. Please,
    revert. anon_inode_getfd() should be used only when there's no possible
    failures past its call".

    Signed-off-by: Paolo Bonzini

    Paolo Bonzini
     
  • The failure of create debugfs of VM will return directly without release
    the anon file. It will leak memory and file descriptors, even through
    be not serious.

    Signed-off-by: Liu Shuo
    Fixes: 536a6f88c49dd739961ffd53774775afed852c83
    Signed-off-by: Paolo Bonzini

    Liu Shuo
     
  • When freeing the nested resources of a vcpu, there is an assumption that
    the vcpu's vmcs01 is the current VMCS on the CPU that executes
    nested_release_vmcs12(). If this assumption is violated, the vcpu's
    vmcs01 may be made active on multiple CPUs at the same time, in
    violation of Intel's specification. Moreover, since the vcpu's vmcs01 is
    not VMCLEARed on every CPU on which it is active, it can linger in a
    CPU's VMCS cache after it has been freed and potentially
    repurposed. Subsequent eviction from the CPU's VMCS cache on a capacity
    miss can result in memory corruption.

    It is not sufficient for vmx_free_vcpu() to call vmx_load_vmcs01(). If
    the vcpu in question was last loaded on a different CPU, it must be
    migrated to the current CPU before calling vmx_load_vmcs01().

    Signed-off-by: Jim Mattson
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini

    Jim Mattson
     

16 Jun, 2016

1 commit

  • These days, we experienced one guest crash with 8 cores and 3 disks,
    with qemu error logs as bellow:

    qemu-system-x86_64: /build/qemu-2.0.0/kvm-all.c:984:
    kvm_irqchip_commit_routes: Assertion `ret == 0' failed.

    And then we found one patch(bdf026317d) in qemu tree, which said
    could fix this bug.

    Execute the following script will reproduce the BUG quickly:

    irq_affinity.sh
    ========================================================================

    vda_irq_num=25
    vdb_irq_num=27
    while [ 1 ]
    do
    for irq in {1,2,4,8,10,20,40,80}
    do
    echo $irq > /proc/irq/$vda_irq_num/smp_affinity
    echo $irq > /proc/irq/$vdb_irq_num/smp_affinity
    dd if=/dev/vda of=/dev/zero bs=4K count=100 iflag=direct
    dd if=/dev/vdb of=/dev/zero bs=4K count=100 iflag=direct
    done
    done
    ========================================================================

    The following qemu log is added in the qemu code and is displayed when
    this bug reproduced:

    kvm_irqchip_commit_routes: max gsi: 1008, nr_allocated_irq_routes: 1024,
    irq_routes->nr: 1024, gsi_count: 1024.

    That's to say when irq_routes->nr == 1024, there are 1024 routing entries,
    but in the kernel code when routes->nr >= 1024, will just return -EINVAL;

    The nr is the number of the routing entries which is in of
    [1 ~ KVM_MAX_IRQ_ROUTES], not the index in [0 ~ KVM_MAX_IRQ_ROUTES - 1].

    This patch fix the BUG above.

    Cc: stable@vger.kernel.org
    Signed-off-by: Xiubo Li
    Signed-off-by: Wei Tang
    Signed-off-by: Zhang Zhuoyu
    Signed-off-by: Paolo Bonzini

    Xiubo Li
     

02 Jun, 2016

3 commits

  • This causes an ugly dmesg splat. Beautified syzkaller testcase:

    #include
    #include
    #include
    #include
    #include

    long r[8];

    int main()
    {
    struct kvm_irq_routing ir = { 0 };
    r[2] = open("/dev/kvm", O_RDWR);
    r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
    r[4] = ioctl(r[3], KVM_SET_GSI_ROUTING, &ir);
    return 0;
    }

    Reported-by: Dmitry Vyukov
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Radim Krčmář

    Paolo Bonzini
     
  • Found by syzkaller:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000120
    IP: [] kvm_irq_map_gsi+0x12/0x90 [kvm]
    PGD 6f80b067 PUD b6535067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 3 PID: 4988 Comm: a.out Not tainted 4.4.9-300.fc23.x86_64 #1
    [...]
    Call Trace:
    [] irqfd_update+0x32/0xc0 [kvm]
    [] kvm_irqfd+0x3dc/0x5b0 [kvm]
    [] kvm_vm_ioctl+0x164/0x6f0 [kvm]
    [] do_vfs_ioctl+0x298/0x480
    [] SyS_ioctl+0x79/0x90
    [] tracesys_phase2+0x84/0x89
    Code: b5 71 a7 e0 5b 41 5c 41 5d 5d f3 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 8f 10 2e 00 00 31 c0 48 89 e5 91 20 01 00 00 76 6a 48 63 d2 48 8b 94 d1 28 01 00 00 48 85
    RIP [] kvm_irq_map_gsi+0x12/0x90 [kvm]
    RSP
    CR2: 0000000000000120

    Testcase:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    long r[26];

    int main()
    {
    memset(r, -1, sizeof(r));
    r[2] = open("/dev/kvm", 0);
    r[3] = ioctl(r[2], KVM_CREATE_VM, 0);

    struct kvm_irqfd ifd;
    ifd.fd = syscall(SYS_eventfd2, 5, 0);
    ifd.gsi = 3;
    ifd.flags = 2;
    ifd.resamplefd = ifd.fd;
    r[25] = ioctl(r[3], KVM_IRQFD, &ifd);
    return 0;
    }

    Reported-by: Dmitry Vyukov
    Cc: stable@vger.kernel.org
    Signed-off-by: Paolo Bonzini
    Signed-off-by: Radim Krčmář

    Paolo Bonzini
     
  • When changing the active bit from an MMIO trap, we decide to
    explode if the intid is that of a private interrupt.

    This flawed logic comes from the fact that we were assuming that
    kvm_vcpu_kick() as called by kvm_arm_halt_vcpu() would not return before
    the called vcpu responded, but this is not the case, so we need to
    perform this wait even for private interrupts.

    Dropping the BUG_ON seems like the right thing to do.

    [ Commit message tweaked by Christoffer ]

    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     

31 May, 2016

3 commits

  • When reading back from the list registers, we need to perform
    two actions for level interrupts:
    1) clear the soft-pending bit if the interrupt is not pending
    anymore *in the list register*
    2) resample the line level and propagate it to the pending state

    But these two actions shouldn't be linked, and we should *always*
    resample the line level, no matter what state is in the list
    register. Otherwise, we may end-up injecting spurious interrupts
    that have been already retired.

    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     
  • When reading back from the list registers, we need to perform
    two actions for level interrupts:
    1) clear the soft-pending bit if the interrupt is not pending
    anymore *in the list register*
    2) resample the line level and propagate it to the pending state

    But these two actions shouldn't be linked, and we should *always*
    resample the line level, no matter what state is in the list
    register. Otherwise, we may end-up injecting spurious interrupts
    that have been already retired.

    Signed-off-by: Marc Zyngier
    Signed-off-by: Christoffer Dall

    Marc Zyngier
     
  • When saving the state of the list registers, it is critical to
    reset them zero, as we could otherwise leave unexpected EOI
    interrupts pending for virtual level interrupts.

    Cc: stable@vger.kernel.org # v4.6+
    Signed-off-by: Christoffer Dall
    Signed-off-by: Marc Zyngier

    Christoffer Dall
     

25 May, 2016

1 commit

  • This patch adds a kvm debugfs subdirectory for each VM, which is named
    after its pid and file descriptor. The directories contain the same
    kind of files that are already in the kvm debugfs directory, but the
    data exported through them is now VM specific.

    This makes the debugfs kvm data a convenient alternative to the
    tracepoints which already have per VM data. The debugfs data is easy
    to read and low overhead.

    CC: Dan Carpenter [includes fixes by Dan Carpenter]
    Signed-off-by: Janosch Frank
    Signed-off-by: Paolo Bonzini

    Janosch Frank
     

24 May, 2016

1 commit

  • …it/kvmarm/kvmarm into kvm-next

    KVM/ARM Changes for v4.7 take 2

    "The GIC is dead; Long live the GIC"

    This set of changes include the new vgic, which is a reimplementation of
    our horribly broken legacy vgic implementation. The two implementations
    will live side-by-side (with the new being the configured default) for
    one kernel release and then we'll remove it.

    Also fixes a non-critical issue with virtual abort injection to guests.

    Paolo Bonzini
     

20 May, 2016

27 commits

  • When modifying the active state of an interrupt via the MMIO interface,
    we should ensure that the write has the intended effect.

    If a guest sets an interrupt to active, but that interrupt is already
    flushed into a list register on a running VCPU, then that VCPU will
    write the active state back into the struct vgic_irq upon returning from
    the guest and syncing its state. This is a non-benign race, because the
    guest can observe that an interrupt is not active, and it can have a
    reasonable expectations that other VCPUs will not ack any IRQs, and then
    set the state to active, and expect it to stay that way. Currently we
    are not honoring this case.

    Thefore, change both the SACTIVE and CACTIVE mmio handlers to stop the
    world, change the irq state, potentially queue the irq if we're setting
    it to active, and then continue.

    We take this chance to slightly optimize these functions by not stopping
    the world when touching private interrupts where there is inherently no
    possible race.

    Signed-off-by: Christoffer Dall

    Christoffer Dall
     
  • Now that the new VGIC implementation has reached feature parity with
    the old one, add the new files to the build system and add a Kconfig
    option to switch between the two versions.
    We set the default to the new version to get maximum test coverage,
    in case people experience problems they can switch back to the old
    behaviour if needed.

    Signed-off-by: Andre Przywara
    Acked-by: Christoffer Dall

    Andre Przywara
     
  • We now store the mapped hardware IRQ number in our struct, so we
    don't need the irq_phys_map for the new VGIC.
    Implement the hardware IRQ mapping on top of the reworked arch
    timer interface.

    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Andre Przywara
     
  • Connect to the new VGIC to the irqfd framework, so that we can
    inject IRQs.
    GSI routing and MSI routing is not yet implemented.

    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Andre Przywara
     
  • Enable the VGIC operation by properly initialising the registers
    in the hypervisor GIC interface.

    Signed-off-by: Eric Auger
    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Eric Auger
     
  • map_resources is the last initialization step. It is executed on
    first VCPU run. At that stage the code checks that userspace has provided
    the base addresses for the relevant VGIC regions, which depend on the
    type of VGIC that is exposed to the guest. Also we check if the two
    regions overlap.
    If the checks succeeded, we register the respective register frames with
    the kvm_io_bus framework.

    If we emulate a GICv2, the function also forces vgic_init execution if
    it has not been executed yet. Also we map the virtual GIC CPU interface
    onto the guest's CPU interface.

    Signed-off-by: Eric Auger
    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Eric Auger
     
  • This patch allocates and initializes the data structures used
    to model the vgic distributor and virtual cpu interfaces. At that
    stage the number of IRQs and number of virtual CPUs is frozen.

    Signed-off-by: Eric Auger
    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Eric Auger
     
  • This patch implements the vgic_creation function which is
    called on CREATE_IRQCHIP VM IOCTL (v2 only) or KVM_CREATE_DEVICE

    Signed-off-by: Eric Auger
    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Eric Auger
     
  • Implements kvm_vgic_hyp_init and vgic_probe function.
    This uses the new firmware independent VGIC probing to support both ACPI
    and DT based systems (code from Marc Zyngier).

    The vgic_global struct is enriched with new fields populated
    by those functions.

    Signed-off-by: Eric Auger
    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Eric Auger
     
  • Using the VMCR accessors we provide access to GIC CPU interface state
    to userland by wiring it up to the existing userland interface.
    [Marc: move and make VMCR accessors static, streamline MMIO handlers]

    Signed-off-by: Andre Przywara
    Signed-off-by: Marc Zyngier
    Reviewed-by: Christoffer Dall

    Andre Przywara
     
  • Since the GIC CPU interface is always virtualized by the hardware,
    we don't have CPU interface state information readily available in our
    emulation if userland wants to save or restore it.
    Fortunately the GIC hypervisor interface provides the VMCR register to
    access the required virtual CPU interface bits.
    Provide wrappers for GICv2 and GICv3 hosts to have access to this
    register.

    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Andre Przywara
     
  • Userland may want to save and restore the state of the in-kernel VGIC,
    so we provide the code which takes a userland request and translate
    that into calls to our MMIO framework.

    From Christoffer:
    When accessing the VGIC state from userspace we really don't want a VCPU
    to be messing with the state at the same time, and the API specifies
    that we should return -EBUSY if any VCPUs are running.
    Check and prevent VCPUs from running by grabbing their mutexes, one by
    one, and error out if we fail.
    (Note: This could potentially be simplified to just do a simple check
    and see if any VCPUs are running, and return -EBUSY then, without
    enforcing the locking throughout the duration of the uaccess, if we
    think that taking/releasing all these mutexes for every single GIC
    register access is too heavyweight.)

    Signed-off-by: Andre Przywara
    Signed-off-by: Christoffer Dall

    Andre Przywara
     
  • Userland can access the emulated GIC to save and restore its state
    for initialization or migration purposes.
    The kvm_io_bus API requires an absolute gpa, which does not fit the
    KVM_DEV_ARM_VGIC_GRP_DIST_REGS user API, that only provides relative
    offsets. So we provide a wrapper to plug into our MMIO framework and
    find the respective register handler.

    Signed-off-by: Christoffer Dall
    Signed-off-by: Andre Przywara

    Christoffer Dall
     
  • This patch implements the switches for KVM_DEV_ARM_VGIC_GRP_DIST_REGS
    and KVM_DEV_ARM_VGIC_GRP_CPU_REGS API which allows the userspace to
    access VGIC registers.

    Signed-off-by: Eric Auger
    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Eric Auger
     
  • This patch implements the KVM_DEV_ARM_VGIC_GRP_ADDR group which
    enables to set the base address of GIC regions as seen by the guest.

    Signed-off-by: Eric Auger
    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Eric Auger
     
  • kvm_vgic_addr is used by the userspace to set the base address of
    the following register regions, as seen by the guest:
    - distributor(v2 and v3),
    - re-distributors (v3),
    - CPU interface (v2).

    Signed-off-by: Eric Auger
    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Eric Auger
     
  • This patch implements the KVM_DEV_ARM_VGIC_GRP_CTRL group API
    featuring KVM_DEV_ARM_VGIC_CTRL_INIT attribute. The vgic_init
    function is not yet implemented though.

    Signed-off-by: Eric Auger
    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Eric Auger
     
  • This patch implements the KVM_DEV_ARM_VGIC_GRP_NR_IRQS group. This
    modality is supported by both VGIC V2 and V3 KVM device as will be
    other groups, hence the introduction of common helpers.

    Signed-off-by: Eric Auger
    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Eric Auger
     
  • This patch introduces the skeleton for the KVM device operations
    associated to KVM_DEV_TYPE_ARM_VGIC_V2 and KVM_DEV_TYPE_ARM_VGIC_V3.

    At that stage kvm_vgic_create is stubbed.

    Signed-off-by: Eric Auger
    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Eric Auger
     
  • In contrast to GICv2 SGIs in a GICv3 implementation are not triggered
    by a MMIO write, but with a system register write. KVM knows about
    that register already, we just need to implement the handler and wire
    it up to the core KVM/ARM code.

    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Andre Przywara
     
  • Since GICv3 supports much more than the 8 CPUs the GICv2 ITARGETSR
    register can handle, the new IROUTER register covers the whole range
    of possible target (V)CPUs by using the same MPIDR that the cores
    report themselves.
    In addition to translating this MPIDR into a vcpu pointer we store
    the originally written value as well. The architecture allows to
    write any values into the register, which must be read back as written.

    Since we don't support affinity level 3, we don't need to take care
    about the upper word of this 64-bit register, which simplifies the
    handling a bit.

    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Andre Przywara
     
  • We implement the only one ID register that is required by the
    architecture, also this is the one that Linux actually checks.

    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Andre Przywara
     
  • The redistributor TYPER tells the OS about the associated MPIDR,
    also the LAST bit is crucial to determine the number of redistributors.

    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Andre Przywara
     
  • As in the GICv2 emulation we handle those three registers in one
    function.

    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Andre Przywara
     
  • Create a new file called vgic-mmio-v3.c and describe the GICv3
    distributor and redistributor registers there.
    This adds a special macro to deal with the split of SGI/PPI in the
    redistributor and SPIs in the distributor, which allows us to reuse
    the existing GICv2 handlers for those registers which are compatible.
    Also we provide a function to deal with the registration of the two
    separate redistributor frames per VCPU.

    Signed-off-by: Andre Przywara
    Reviewed-by: Eric Auger
    Reviewed-by: Christoffer Dall

    Andre Przywara
     
  • As this register is v2 specific, its implementation lives entirely
    in vgic-mmio-v2.c.
    This register allows setting the source mask of an IPI.

    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Andre Przywara
     
  • Triggering an IPI via this register is v2 specific, so the
    implementation lives entirely in vgic-mmio-v2.c.

    Signed-off-by: Andre Przywara
    Reviewed-by: Christoffer Dall

    Andre Przywara