15 Jul, 2016
4 commits
-
Once anon_inode_getfd() has succeeded, it's impossible to undo
in a clean way and no, sys_close() is not usable in such cases.
Use anon_inode_getfile() and get_unused_fd_flags() to get struct file
and descriptor and do *not* install the file into the descriptor table
until after the last possible failure exit.Signed-off-by: Paolo Bonzini
-
This reverts commit 77ecc085fed1af1000ca719522977b960aa6da52.
Al Viro colorfully says: "You should *NEVER* use sys_close() on failure
exit paths like that. Moreover, this kvm_put_kvm() becomes a double-put,
since closing the damn file will drop that reference to kvm. Please,
revert. anon_inode_getfd() should be used only when there's no possible
failures past its call".Signed-off-by: Paolo Bonzini
-
The failure of create debugfs of VM will return directly without release
the anon file. It will leak memory and file descriptors, even through
be not serious.Signed-off-by: Liu Shuo
Fixes: 536a6f88c49dd739961ffd53774775afed852c83
Signed-off-by: Paolo Bonzini -
When freeing the nested resources of a vcpu, there is an assumption that
the vcpu's vmcs01 is the current VMCS on the CPU that executes
nested_release_vmcs12(). If this assumption is violated, the vcpu's
vmcs01 may be made active on multiple CPUs at the same time, in
violation of Intel's specification. Moreover, since the vcpu's vmcs01 is
not VMCLEARed on every CPU on which it is active, it can linger in a
CPU's VMCS cache after it has been freed and potentially
repurposed. Subsequent eviction from the CPU's VMCS cache on a capacity
miss can result in memory corruption.It is not sufficient for vmx_free_vcpu() to call vmx_load_vmcs01(). If
the vcpu in question was last loaded on a different CPU, it must be
migrated to the current CPU before calling vmx_load_vmcs01().Signed-off-by: Jim Mattson
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini
16 Jun, 2016
1 commit
-
These days, we experienced one guest crash with 8 cores and 3 disks,
with qemu error logs as bellow:qemu-system-x86_64: /build/qemu-2.0.0/kvm-all.c:984:
kvm_irqchip_commit_routes: Assertion `ret == 0' failed.And then we found one patch(bdf026317d) in qemu tree, which said
could fix this bug.Execute the following script will reproduce the BUG quickly:
irq_affinity.sh
========================================================================vda_irq_num=25
vdb_irq_num=27
while [ 1 ]
do
for irq in {1,2,4,8,10,20,40,80}
do
echo $irq > /proc/irq/$vda_irq_num/smp_affinity
echo $irq > /proc/irq/$vdb_irq_num/smp_affinity
dd if=/dev/vda of=/dev/zero bs=4K count=100 iflag=direct
dd if=/dev/vdb of=/dev/zero bs=4K count=100 iflag=direct
done
done
========================================================================The following qemu log is added in the qemu code and is displayed when
this bug reproduced:kvm_irqchip_commit_routes: max gsi: 1008, nr_allocated_irq_routes: 1024,
irq_routes->nr: 1024, gsi_count: 1024.That's to say when irq_routes->nr == 1024, there are 1024 routing entries,
but in the kernel code when routes->nr >= 1024, will just return -EINVAL;The nr is the number of the routing entries which is in of
[1 ~ KVM_MAX_IRQ_ROUTES], not the index in [0 ~ KVM_MAX_IRQ_ROUTES - 1].This patch fix the BUG above.
Cc: stable@vger.kernel.org
Signed-off-by: Xiubo Li
Signed-off-by: Wei Tang
Signed-off-by: Zhang Zhuoyu
Signed-off-by: Paolo Bonzini
02 Jun, 2016
3 commits
-
This causes an ugly dmesg splat. Beautified syzkaller testcase:
#include
#include
#include
#include
#includelong r[8];
int main()
{
struct kvm_irq_routing ir = { 0 };
r[2] = open("/dev/kvm", O_RDWR);
r[3] = ioctl(r[2], KVM_CREATE_VM, 0);
r[4] = ioctl(r[3], KVM_SET_GSI_ROUTING, &ir);
return 0;
}Reported-by: Dmitry Vyukov
Signed-off-by: Paolo Bonzini
Signed-off-by: Radim Krčmář -
Found by syzkaller:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000120
IP: [] kvm_irq_map_gsi+0x12/0x90 [kvm]
PGD 6f80b067 PUD b6535067 PMD 0
Oops: 0000 [#1] SMP
CPU: 3 PID: 4988 Comm: a.out Not tainted 4.4.9-300.fc23.x86_64 #1
[...]
Call Trace:
[] irqfd_update+0x32/0xc0 [kvm]
[] kvm_irqfd+0x3dc/0x5b0 [kvm]
[] kvm_vm_ioctl+0x164/0x6f0 [kvm]
[] do_vfs_ioctl+0x298/0x480
[] SyS_ioctl+0x79/0x90
[] tracesys_phase2+0x84/0x89
Code: b5 71 a7 e0 5b 41 5c 41 5d 5d f3 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 8b 8f 10 2e 00 00 31 c0 48 89 e5 91 20 01 00 00 76 6a 48 63 d2 48 8b 94 d1 28 01 00 00 48 85
RIP [] kvm_irq_map_gsi+0x12/0x90 [kvm]
RSP
CR2: 0000000000000120Testcase:
#include
#include
#include
#include
#include
#include
#includelong r[26];
int main()
{
memset(r, -1, sizeof(r));
r[2] = open("/dev/kvm", 0);
r[3] = ioctl(r[2], KVM_CREATE_VM, 0);struct kvm_irqfd ifd;
ifd.fd = syscall(SYS_eventfd2, 5, 0);
ifd.gsi = 3;
ifd.flags = 2;
ifd.resamplefd = ifd.fd;
r[25] = ioctl(r[3], KVM_IRQFD, &ifd);
return 0;
}Reported-by: Dmitry Vyukov
Cc: stable@vger.kernel.org
Signed-off-by: Paolo Bonzini
Signed-off-by: Radim Krčmář -
When changing the active bit from an MMIO trap, we decide to
explode if the intid is that of a private interrupt.This flawed logic comes from the fact that we were assuming that
kvm_vcpu_kick() as called by kvm_arm_halt_vcpu() would not return before
the called vcpu responded, but this is not the case, so we need to
perform this wait even for private interrupts.Dropping the BUG_ON seems like the right thing to do.
[ Commit message tweaked by Christoffer ]
Signed-off-by: Marc Zyngier
Signed-off-by: Christoffer Dall
31 May, 2016
3 commits
-
When reading back from the list registers, we need to perform
two actions for level interrupts:
1) clear the soft-pending bit if the interrupt is not pending
anymore *in the list register*
2) resample the line level and propagate it to the pending stateBut these two actions shouldn't be linked, and we should *always*
resample the line level, no matter what state is in the list
register. Otherwise, we may end-up injecting spurious interrupts
that have been already retired.Signed-off-by: Marc Zyngier
Signed-off-by: Christoffer Dall -
When reading back from the list registers, we need to perform
two actions for level interrupts:
1) clear the soft-pending bit if the interrupt is not pending
anymore *in the list register*
2) resample the line level and propagate it to the pending stateBut these two actions shouldn't be linked, and we should *always*
resample the line level, no matter what state is in the list
register. Otherwise, we may end-up injecting spurious interrupts
that have been already retired.Signed-off-by: Marc Zyngier
Signed-off-by: Christoffer Dall -
When saving the state of the list registers, it is critical to
reset them zero, as we could otherwise leave unexpected EOI
interrupts pending for virtual level interrupts.Cc: stable@vger.kernel.org # v4.6+
Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier
25 May, 2016
1 commit
-
This patch adds a kvm debugfs subdirectory for each VM, which is named
after its pid and file descriptor. The directories contain the same
kind of files that are already in the kvm debugfs directory, but the
data exported through them is now VM specific.This makes the debugfs kvm data a convenient alternative to the
tracepoints which already have per VM data. The debugfs data is easy
to read and low overhead.CC: Dan Carpenter [includes fixes by Dan Carpenter]
Signed-off-by: Janosch Frank
Signed-off-by: Paolo Bonzini
24 May, 2016
1 commit
-
…it/kvmarm/kvmarm into kvm-next
KVM/ARM Changes for v4.7 take 2
"The GIC is dead; Long live the GIC"
This set of changes include the new vgic, which is a reimplementation of
our horribly broken legacy vgic implementation. The two implementations
will live side-by-side (with the new being the configured default) for
one kernel release and then we'll remove it.Also fixes a non-critical issue with virtual abort injection to guests.
20 May, 2016
27 commits
-
When modifying the active state of an interrupt via the MMIO interface,
we should ensure that the write has the intended effect.If a guest sets an interrupt to active, but that interrupt is already
flushed into a list register on a running VCPU, then that VCPU will
write the active state back into the struct vgic_irq upon returning from
the guest and syncing its state. This is a non-benign race, because the
guest can observe that an interrupt is not active, and it can have a
reasonable expectations that other VCPUs will not ack any IRQs, and then
set the state to active, and expect it to stay that way. Currently we
are not honoring this case.Thefore, change both the SACTIVE and CACTIVE mmio handlers to stop the
world, change the irq state, potentially queue the irq if we're setting
it to active, and then continue.We take this chance to slightly optimize these functions by not stopping
the world when touching private interrupts where there is inherently no
possible race.Signed-off-by: Christoffer Dall
-
Now that the new VGIC implementation has reached feature parity with
the old one, add the new files to the build system and add a Kconfig
option to switch between the two versions.
We set the default to the new version to get maximum test coverage,
in case people experience problems they can switch back to the old
behaviour if needed.Signed-off-by: Andre Przywara
Acked-by: Christoffer Dall -
We now store the mapped hardware IRQ number in our struct, so we
don't need the irq_phys_map for the new VGIC.
Implement the hardware IRQ mapping on top of the reworked arch
timer interface.Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
Connect to the new VGIC to the irqfd framework, so that we can
inject IRQs.
GSI routing and MSI routing is not yet implemented.Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
Enable the VGIC operation by properly initialising the registers
in the hypervisor GIC interface.Signed-off-by: Eric Auger
Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
map_resources is the last initialization step. It is executed on
first VCPU run. At that stage the code checks that userspace has provided
the base addresses for the relevant VGIC regions, which depend on the
type of VGIC that is exposed to the guest. Also we check if the two
regions overlap.
If the checks succeeded, we register the respective register frames with
the kvm_io_bus framework.If we emulate a GICv2, the function also forces vgic_init execution if
it has not been executed yet. Also we map the virtual GIC CPU interface
onto the guest's CPU interface.Signed-off-by: Eric Auger
Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
This patch allocates and initializes the data structures used
to model the vgic distributor and virtual cpu interfaces. At that
stage the number of IRQs and number of virtual CPUs is frozen.Signed-off-by: Eric Auger
Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
This patch implements the vgic_creation function which is
called on CREATE_IRQCHIP VM IOCTL (v2 only) or KVM_CREATE_DEVICESigned-off-by: Eric Auger
Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
Implements kvm_vgic_hyp_init and vgic_probe function.
This uses the new firmware independent VGIC probing to support both ACPI
and DT based systems (code from Marc Zyngier).The vgic_global struct is enriched with new fields populated
by those functions.Signed-off-by: Eric Auger
Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
Using the VMCR accessors we provide access to GIC CPU interface state
to userland by wiring it up to the existing userland interface.
[Marc: move and make VMCR accessors static, streamline MMIO handlers]Signed-off-by: Andre Przywara
Signed-off-by: Marc Zyngier
Reviewed-by: Christoffer Dall -
Since the GIC CPU interface is always virtualized by the hardware,
we don't have CPU interface state information readily available in our
emulation if userland wants to save or restore it.
Fortunately the GIC hypervisor interface provides the VMCR register to
access the required virtual CPU interface bits.
Provide wrappers for GICv2 and GICv3 hosts to have access to this
register.Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
Userland may want to save and restore the state of the in-kernel VGIC,
so we provide the code which takes a userland request and translate
that into calls to our MMIO framework.From Christoffer:
When accessing the VGIC state from userspace we really don't want a VCPU
to be messing with the state at the same time, and the API specifies
that we should return -EBUSY if any VCPUs are running.
Check and prevent VCPUs from running by grabbing their mutexes, one by
one, and error out if we fail.
(Note: This could potentially be simplified to just do a simple check
and see if any VCPUs are running, and return -EBUSY then, without
enforcing the locking throughout the duration of the uaccess, if we
think that taking/releasing all these mutexes for every single GIC
register access is too heavyweight.)Signed-off-by: Andre Przywara
Signed-off-by: Christoffer Dall -
Userland can access the emulated GIC to save and restore its state
for initialization or migration purposes.
The kvm_io_bus API requires an absolute gpa, which does not fit the
KVM_DEV_ARM_VGIC_GRP_DIST_REGS user API, that only provides relative
offsets. So we provide a wrapper to plug into our MMIO framework and
find the respective register handler.Signed-off-by: Christoffer Dall
Signed-off-by: Andre Przywara -
This patch implements the switches for KVM_DEV_ARM_VGIC_GRP_DIST_REGS
and KVM_DEV_ARM_VGIC_GRP_CPU_REGS API which allows the userspace to
access VGIC registers.Signed-off-by: Eric Auger
Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
This patch implements the KVM_DEV_ARM_VGIC_GRP_ADDR group which
enables to set the base address of GIC regions as seen by the guest.Signed-off-by: Eric Auger
Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
kvm_vgic_addr is used by the userspace to set the base address of
the following register regions, as seen by the guest:
- distributor(v2 and v3),
- re-distributors (v3),
- CPU interface (v2).Signed-off-by: Eric Auger
Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
This patch implements the KVM_DEV_ARM_VGIC_GRP_CTRL group API
featuring KVM_DEV_ARM_VGIC_CTRL_INIT attribute. The vgic_init
function is not yet implemented though.Signed-off-by: Eric Auger
Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
This patch implements the KVM_DEV_ARM_VGIC_GRP_NR_IRQS group. This
modality is supported by both VGIC V2 and V3 KVM device as will be
other groups, hence the introduction of common helpers.Signed-off-by: Eric Auger
Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
This patch introduces the skeleton for the KVM device operations
associated to KVM_DEV_TYPE_ARM_VGIC_V2 and KVM_DEV_TYPE_ARM_VGIC_V3.At that stage kvm_vgic_create is stubbed.
Signed-off-by: Eric Auger
Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
In contrast to GICv2 SGIs in a GICv3 implementation are not triggered
by a MMIO write, but with a system register write. KVM knows about
that register already, we just need to implement the handler and wire
it up to the core KVM/ARM code.Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
Since GICv3 supports much more than the 8 CPUs the GICv2 ITARGETSR
register can handle, the new IROUTER register covers the whole range
of possible target (V)CPUs by using the same MPIDR that the cores
report themselves.
In addition to translating this MPIDR into a vcpu pointer we store
the originally written value as well. The architecture allows to
write any values into the register, which must be read back as written.Since we don't support affinity level 3, we don't need to take care
about the upper word of this 64-bit register, which simplifies the
handling a bit.Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
We implement the only one ID register that is required by the
architecture, also this is the one that Linux actually checks.Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
The redistributor TYPER tells the OS about the associated MPIDR,
also the LAST bit is crucial to determine the number of redistributors.Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
As in the GICv2 emulation we handle those three registers in one
function.Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
Create a new file called vgic-mmio-v3.c and describe the GICv3
distributor and redistributor registers there.
This adds a special macro to deal with the split of SGI/PPI in the
redistributor and SPIs in the distributor, which allows us to reuse
the existing GICv2 handlers for those registers which are compatible.
Also we provide a function to deal with the registration of the two
separate redistributor frames per VCPU.Signed-off-by: Andre Przywara
Reviewed-by: Eric Auger
Reviewed-by: Christoffer Dall -
As this register is v2 specific, its implementation lives entirely
in vgic-mmio-v2.c.
This register allows setting the source mask of an IPI.Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall -
Triggering an IPI via this register is v2 specific, so the
implementation lives entirely in vgic-mmio-v2.c.Signed-off-by: Andre Przywara
Reviewed-by: Christoffer Dall