26 Dec, 2011
1 commit
-
Unlike all of the other cpuid bits, the TSC deadline timer bit is set
unconditionally, regardless of what userspace wants.This is broken in several ways:
- if userspace doesn't use KVM_CREATE_IRQCHIP, and doesn't emulate the TSC
deadline timer feature, a guest that uses the feature will break
- live migration to older host kernels that don't support the TSC deadline
timer will cause the feature to be pulled from under the guest's feet;
breaking it
- guests that are broken wrt the feature will fail.Fix by not enabling the feature automatically; instead report it to userspace.
Because the feature depends on KVM_CREATE_IRQCHIP, which we cannot guarantee
will be called, we expose it via a KVM_CAP_TSC_DEADLINE_TIMER and not
KVM_GET_SUPPORTED_CPUID.Fixes the Illumos guest kernel, which uses the TSC deadline timer feature.
[avi: add the KVM_CAP + documentation]
Reported-by: Alexey Zaytsev
Tested-by: Alexey Zaytsev
Signed-off-by: Jan Kiszka
Signed-off-by: Avi Kivity
25 Dec, 2011
1 commit
-
User space may create the PIT and forgets about setting up the irqchips.
In that case, firing PIT IRQs will crash the host:BUG: unable to handle kernel NULL pointer dereference at 0000000000000128
IP: [] kvm_set_irq+0x30/0x170 [kvm]
...
Call Trace:
[] pit_do_work+0x51/0xd0 [kvm]
[] process_one_work+0x111/0x4d0
[] worker_thread+0x152/0x340
[] kthread+0x7e/0x90
[] kernel_thread_helper+0x4/0x10Prevent this by checking the irqchip mode before starting a timer. We
can't deny creating the PIT if the irqchips aren't set up yet as
current user land expects this order to work.Signed-off-by: Jan Kiszka
Signed-off-by: Marcelo Tosatti
17 Nov, 2011
3 commits
-
Signed-off-by: Gleb Natapov
Signed-off-by: Avi Kivity -
Support guest/host-only profiling by switch perf msrs on
a guest entry if needed.Signed-off-by: Gleb Natapov
Signed-off-by: Avi Kivity -
Some cpus have special support for switching PERF_GLOBAL_CTRL msr.
Add logic to detect if such support exists and works properly and extend
msr switching code to use it if available. Also extend number of generic
msr switching entries to 8.Signed-off-by: Gleb Natapov
Signed-off-by: Avi Kivity
31 Oct, 2011
1 commit
-
* 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (33 commits)
iommu/core: Remove global iommu_ops and register_iommu
iommu/msm: Use bus_set_iommu instead of register_iommu
iommu/omap: Use bus_set_iommu instead of register_iommu
iommu/vt-d: Use bus_set_iommu instead of register_iommu
iommu/amd: Use bus_set_iommu instead of register_iommu
iommu/core: Use bus->iommu_ops in the iommu-api
iommu/core: Convert iommu_found to iommu_present
iommu/core: Add bus_type parameter to iommu_domain_alloc
Driver core: Add iommu_ops to bus_type
iommu/core: Define iommu_ops and register_iommu only with CONFIG_IOMMU_API
iommu/amd: Fix wrong shift direction
iommu/omap: always provide iommu debug code
iommu/core: let drivers know if an iommu fault handler isn't installed
iommu/core: export iommu_set_fault_handler()
iommu/omap: Fix build error with !IOMMU_SUPPORT
iommu/omap: Migrate to the generic fault report mechanism
iommu/core: Add fault reporting mechanism
iommu/core: Use PAGE_SIZE instead of hard-coded value
iommu/core: use the existing IS_ALIGNED macro
iommu/msm: ->unmap() should return order of unmapped page
...Fixup trivial conflicts in drivers/iommu/Makefile: "move omap iommu to
dedicated iommu folder" vs "Rename the DMAR and INTR_REMAP config
options" just happened to touch lines next to each other.
30 Oct, 2011
1 commit
-
AMD processors apparently have a bug in the hardware task switching
support when NPT is enabled. If the task switch triggers a NPF, we can
get wrong EXITINTINFO along with that fault. On resume, spurious
exceptions may then be injected into the guest.We were able to reproduce this bug when our guest triggered #SS and the
handler were supposed to run over a separate task with not yet touched
stack pages.Work around the issue by continuing to emulate task switches even in
NPT mode.Signed-off-by: Jan Kiszka
Signed-off-by: Marcelo Tosatti
21 Oct, 2011
2 commits
-
…mu/fault-reporting' and 'api/iommu-ops-per-bus' into next
Conflicts:
drivers/iommu/amd_iommu.c
drivers/iommu/iommu.c -
With per-bus iommu_ops the iommu_found function needs to
work on a bus_type too. This patch adds a bus_type parameter
to that function and converts all call-places.
The function is also renamed to iommu_present because the
function now checks if an iommu is present for a given bus
and does not check for a global iommu anymore.Signed-off-by: Joerg Roedel
05 Oct, 2011
1 commit
-
This patch emulate lapic tsc deadline timer for guest:
Enumerate tsc deadline timer capability by CPUID;
Enable tsc deadline timer mode by lapic MMIO;
Start tsc deadline timer by WRMSR;[jan: use do_div()]
[avi: fix for !irqchip_in_kernel()]
[marcelo: another fix for !irqchip_in_kernel()]Signed-off-by: Liu, Jinsong
Signed-off-by: Jan Kiszka
Signed-off-by: Marcelo Tosatti
Signed-off-by: Avi Kivity
26 Sep, 2011
30 commits
-
If simultaneous NMIs happen, we're supposed to queue the second
and next (collapsing them), but currently we sometimes collapse
the second into the first.Fix by using a counter for pending NMIs instead of a bool; since
the counter limit depends on whether the processor is currently
in an NMI handler, which can only be checked in vcpu context
(via the NMI mask), we add a new KVM_REQ_NMI to request recalculation
of the counter.Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
The opcodes
push %seg
pop %seg
l%seg, %mem, %reg (e.g. lds/les/lss/lfs/lgs)all have an segment register encoded in the instruction. To allow reuse,
decode the segment number into src2 during the decode stage instead of the
execution stage.Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
Use the same technique as the other OpMem variants, and goto mem_common.
Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
OpReg decoding has a hack that inhibits byte registers for movsx and movzx
instructions. It should be replaced by something better, but meanwhile,
qualify that the hack is only active for the destination operand.Note these instructions only use OpReg for the destination, but better to
be explicit about it.Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
Similar to SrcImmUByte.
Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
Op fields are going to grow by a bit, we need two free bits.
Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
Unifiying the operands means not taking advantage of the fact that some
operand types can only go into certain operands (for example, DI can only
be used by the destination), so we need more bits to hold the operand type.Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
Instead of decoding each operand using its own code, use a generic
function. Start with the destination operand.Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
Simplifies further generalization of decode.
Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
Certain guests, specifically RTOSes, request faster periodic timers than
what we allow by default. Add a module parameter to adjust the limit for
non-standard setups. Also add a rate-limited warning in case the guest
requested more.Signed-off-by: Jan Kiszka
Signed-off-by: Marcelo Tosatti -
The use of printk_ratelimit is discouraged, replace it with
pr*_ratelimited or __ratelimit. While at it, convert remaining
guest-triggerable printks to rate-limited variants.Signed-off-by: Jan Kiszka
Signed-off-by: Marcelo Tosatti -
Convert remaining printks that the guest can trigger to apic_printk.
Signed-off-by: Jan Kiszka
Signed-off-by: Marcelo Tosatti -
This avoids that events causing the vmexit are recorded before the
actual exit reason.Signed-off-by: Jan Kiszka
Signed-off-by: Marcelo Tosatti -
The TEST instruction doesn't write its destination operand. This
could cause problems if an MMIO register was accessed using the TEST
instruction. Recently Windows XP was observed to use TEST against
the APIC ICR; this can cause spurious IPIs.Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
emulate_1op_rax_rdx() is always called with the same parameters. Simplify
by passing just the emulation context.Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
We have two emulate-with-extended-accumulator implementations: once
which expect traps (_ex) and one which doesn't (plain). Drop the
plain implementation and always use the one which expects traps;
it will simply return 0 in the _ex argument and we can happily ignore
it.Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
emulate_1op() is always called with the same parameters. Simplify
by passing just the emulation context.Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
emulate_2op_cl() is always called with the same parameters. Simplify
by passing just the emulation context.Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
emulate_2op_cl() is always called with the same parameters. Simplify
by passing just the emulation context.Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
emulate_2op_SrcV(), and its siblings, emulate_2op_SrcV_nobyte()
and emulate_2op_SrcB(), all use the same calling conventions
and all get passed exactly the same parameters. Simplify them
by passing just the emulation context.Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
Instruction emulation for EOI writes can be skipped, since sane
guest simply uses MOV instead of string operations. This is a nice
improvement when guest doesn't support x2apic or hyper-V EOI
support.a single VM bandwidth is observed with ~8% bandwidth improvement
(7.4Gbps->8Gbps), by saving ~5% cycles from EOI emulation.Signed-off-by: Kevin Tian
:
Signed-off-by: Eddie Dong
Signed-off-by: Marcelo Tosatti
Signed-off-by: Avi Kivity -
When the TSC MSR is read by an L2 guest (when L1 allowed this MSR to be
read without exit), we need to return L2's notion of the TSC, not L1's.The current code incorrectly returned L1 TSC, because svm_get_msr() was also
used in x86.c where this was assumed, but now that these places call the new
svm_read_l1_tsc(), the MSR read can be fixed.Signed-off-by: Nadav Har'El
Tested-by: Joerg Roedel
Acked-by: Joerg Roedel
Signed-off-by: Avi Kivity -
This patch fixes two corner cases in nested (L2) handling of TSC-related
issues:1. Somewhat suprisingly, according to the Intel spec, if L1 allows WRMSR to
the TSC MSR without an exit, then this should set L1's TSC value itself - not
offset by vmcs12.TSC_OFFSET (like was wrongly done in the previous code).2. Allow L1 to disable the TSC_OFFSETING control, and then correctly ignore
the vmcs12.TSC_OFFSET.Signed-off-by: Nadav Har'El
Signed-off-by: Avi Kivity -
KVM assumed in several places that reading the TSC MSR returns the value for
L1. This is incorrect, because when L2 is running, the correct TSC read exit
emulation is to return L2's value.We therefore add a new x86_ops function, read_l1_tsc, to use in places that
specifically need to read the L1 TSC, NOT the TSC of the current level of
guest.Note that one change, of one line in kvm_arch_vcpu_load, is made redundant
by a different patch sent by Zachary Amsden (and not yet applied):
kvm_arch_vcpu_load() should not read the guest TSC, and if it didn't, of
course we didn't have to change the call of kvm_get_msr() to read_l1_tsc().[avi: moved callback to kvm_x86_ops tsc block]
Signed-off-by: Nadav Har'El
Acked-by: Zachary Amsdem
Signed-off-by: Avi Kivity -
This patch fix kvm-unit-tests hanging and incorrect PT_ACCESSED_MASK
bit set in the case of SMEP fault. The code updated 'eperm' after
the variable was checked.Signed-off-by: Yang, Wei
Signed-off-by: Avi Kivity