07 Apr, 2013
1 commit
-
This patch adds support for kvm_gfn_to_hva_cache_init functions for
reads and writes that will cross a page. If the range falls within
the same memslot, then this will be a fast operation. If the range
is split between two memslots, then the slower kvm_read_guest and
kvm_write_guest are used.Tested: Test against kvm_clock unit tests.
Signed-off-by: Andrew Honig
Signed-off-by: Gleb Natapov
20 Mar, 2013
1 commit
-
If the guest specifies a IOAPIC_REG_SELECT with an invalid value and follows
that with a read of the IOAPIC_REG_WINDOW KVM does not properly validate
that request. ioapic_read_indirect contains an
ASSERT(redir_index < IOAPIC_NUM_PINS), but the ASSERT has no effect in
non-debug builds. In recent kernels this allows a guest to cause a kernel
oops by reading invalid memory. In older kernels (pre-3.3) this allows a
guest to read from large ranges of host memory.Tested: tested against apic unit tests.
Signed-off-by: Andrew Honig
Signed-off-by: Marcelo Tosatti
28 Feb, 2013
1 commit
-
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;type T;
expression a,c,d,e;
identifier b;
statement S;
@@-T b;
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin
Acked-by: Paul E. McKenney
Signed-off-by: Sasha Levin
Cc: Wu Fengguang
Cc: Marcelo Tosatti
Cc: Gleb Natapov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Feb, 2013
1 commit
-
This field was needed to differentiate memory slots created by the new
API, KVM_SET_USER_MEMORY_REGION, from those by the old equivalent,
KVM_SET_MEMORY_REGION, whose support was dropped long before:commit b74a07beed0e64bfba413dcb70dd6749c57f43dc
KVM: Remove kernel-allocated memory regionsAlthough we also have private memory slots to which KVM allocates
memory with vm_mmap(), !user_alloc slots in other words, the slot id
should be enough for differentiating them.Note: corresponding function parameters will be removed later.
Reviewed-by: Marcelo Tosatti
Signed-off-by: Takuya Yoshikawa
Signed-off-by: Gleb Natapov
05 Feb, 2013
2 commits
-
As Xiao pointed out, there are a few problems with it:
- kvm_arch_commit_memory_region() write protects the memory slot only
for GET_DIRTY_LOG when modifying the flags.
- FNAME(sync_page) uses the old spte value to set a new one without
checking KVM_MEM_READONLY flag.Since we flush all shadow pages when creating a new slot, the simplest
fix is to disallow such problematic flag changes: this is safe because
no one is doing such things.Reviewed-by: Gleb Natapov
Signed-off-by: Takuya Yoshikawa
Cc: Xiao Guangrong
Cc: Alex Williamson
Signed-off-by: Marcelo Tosatti -
KVM_SET_USER_MEMORY_REGION forces __kvm_set_memory_region() to identify
what kind of change is being requested by checking the arguments. The
current code does this checking at various points in code and each
condition being used there is not easy to understand at first glance.This patch consolidates these checks and introduces an enum to name the
possible changes to clean up the code.Although this does not introduce any functional changes, there is one
change which optimizes the code a bit: if we have nothing to change, the
new code returns 0 immediately.Note that the return value for this case cannot be changed since QEMU
relies on it: we noticed this when we changed it to -EINVAL and got a
section mismatch error at the final stage of live migration.Signed-off-by: Takuya Yoshikawa
Signed-off-by: Marcelo Tosatti
29 Jan, 2013
2 commits
-
yield_to returns -ESRCH, When source and target of yield_to
run queue length is one. When we see three successive failures of
yield_to we assume we are in potential undercommit case and abort
from PLE handler.
The assumption is backed by low probability of wrong decision
for even worst case scenarios such as average runqueue length
between 1 and 2.More detail on rationale behind using three tries:
if p is the probability of finding rq length one on a particular cpu,
and if we do n tries, then probability of exiting ple handler is:p^(n+1) [ because we would have come across one source with rq length
1 and n target cpu rqs with length 1 ]so
num tries: probability of aborting ple handler (1.5x overcommit)
1 1/4
2 1/8
3 1/16We can increase this probability with more tries, but the problem is
the overhead.
Also, If we have tried three times that means we would have iterated
over 3 good eligible vcpus along with many non-eligible candidates. In
worst case if we iterate all the vcpus, we reduce 1x performance and
overcommit performance get hit.note that we do not update last boosted vcpu in failure cases.
Thank Avi for raising question on aborting after first fail from yield_to.Reviewed-by: Srikar Dronamraju
Signed-off-by: Raghavendra K T
Tested-by: Chegu Vinod
Signed-off-by: Gleb Natapov -
Virtual interrupt delivery avoids KVM to inject vAPIC interrupts
manually, which is fully taken care of by the hardware. This needs
some special awareness into existing interrupr injection path:- for pending interrupt, instead of direct injection, we may need
update architecture specific indicators before resuming to guest.- A pending interrupt, which is masked by ISR, should be also
considered in above update action, since hardware will decide
when to inject it at right time. Current has_interrupt and
get_interrupt only returns a valid vector from injection p.o.v.Reviewed-by: Marcelo Tosatti
Signed-off-by: Kevin Tian
Signed-off-by: Yang Zhang
Signed-off-by: Gleb Natapov
27 Jan, 2013
2 commits
-
We've been ignoring read-only mappings and programming everything
into the iommu as read-write. Fix this to only include the write
access flag when read-only is not set.Signed-off-by: Alex Williamson
Signed-off-by: Gleb Natapov -
Memory slot flags can be altered without changing other parameters of
the slot. The read-only attribute is the only one the IOMMU cares
about, so generate an un-map, re-map when this occurs. This also
avoid unnecessarily re-mapping the slot when no IOMMU visible changes
are made.Reviewed-by: Xiao Guangrong
Signed-off-by: Alex Williamson
Signed-off-by: Gleb Natapov
17 Jan, 2013
3 commits
-
One such variable, slot, is enough for holding a pointer temporarily.
We also remove another local variable named slot, which is limited in
a block, since it is confusing to have the same name in this function.Reviewed-by: Marcelo Tosatti
Signed-off-by: Takuya Yoshikawa
Signed-off-by: Gleb Natapov -
Don't need the check for deleting an existing slot or just modifiying
the flags.Reviewed-by: Marcelo Tosatti
Signed-off-by: Takuya Yoshikawa
Signed-off-by: Gleb Natapov -
This makes the separation between the sanity checks and the rest of the
code a bit clearer.Reviewed-by: Marcelo Tosatti
Signed-off-by: Takuya Yoshikawa
Signed-off-by: Gleb Natapov
14 Jan, 2013
1 commit
-
Calling kvm_mmu_slot_remove_write_access() for a deleted slot does
nothing but search for non-existent mmu pages which have mappings to
that deleted memory; this is safe but a waste of time.Since we want to make the function rmap based in a later patch, in a
manner which makes it unsafe to be called for a deleted slot, we makes
the caller see if the slot is non-zero and being dirty logged.Reviewed-by: Marcelo Tosatti
Signed-off-by: Takuya Yoshikawa
Signed-off-by: Gleb Natapov
24 Dec, 2012
1 commit
-
Move repetitive code sequence to a separate function.
Reviewed-by: Alex Williamson
Signed-off-by: Gleb Natapov
23 Dec, 2012
2 commits
-
Previous patch "kvm: Minor memory slot optimization" (b7f69c555ca43)
overlooked the generation field of the memory slots. Re-using the
original memory slots left us with with two slightly different memory
slots with the same generation. To fix this, make update_memslots()
take a new parameter to specify the last generation. This also makes
generation management more explicit to avoid such problems in the future.Reported-by: Takuya Yoshikawa
Signed-off-by: Alex Williamson
Signed-off-by: Gleb Natapov -
This hack is wrong. The pin number of PIT is connected to
2 not 0. This means this hack never takes effect. So it is ok
to remove it.Signed-off-by: Yang Zhang
Signed-off-by: Gleb Natapov
14 Dec, 2012
7 commits
-
We're currently offering a whopping 32 memory slots to user space, an
int is a bit excessive for storing this. We would like to increase
our memslots, but SHRT_MAX should be more than enough.Reviewed-by: Gleb Natapov
Signed-off-by: Alex Williamson
Signed-off-by: Marcelo Tosatti -
There's no need for this to be an int, it holds a boolean.
Move to the end of the struct for alignment.Reviewed-by: Gleb Natapov
Signed-off-by: Alex Williamson
Signed-off-by: Marcelo Tosatti -
It's easy to confuse KVM_MEMORY_SLOTS and KVM_MEM_SLOTS_NUM. One is
the user accessible slots and the other is user + private. Make this
more obvious.Reviewed-by: Gleb Natapov
Signed-off-by: Alex Williamson
Signed-off-by: Marcelo Tosatti -
If a slot is removed or moved in the guest physical address space, we
first allocate and install a new slot array with the invalidated
entry. The old array is then freed. We then proceed to allocate yet
another slot array to install the permanent replacement. Re-use the
original array when this occurs and avoid the extra kfree/kmalloc.Reviewed-by: Gleb Natapov
Signed-off-by: Alex Williamson
Signed-off-by: Marcelo Tosatti -
The iommu integration into memory slots expects memory slots to be
added or removed and doesn't handle the move case. We can unmap
slots from the iommu after we mark them invalid and map them before
installing the final memslot array. Also re-order the kmemdup vs
map so we don't leave iommu mappings if we get ENOMEM.Reviewed-by: Gleb Natapov
Signed-off-by: Alex Williamson
Signed-off-by: Marcelo Tosatti -
The API documents that only flags and guest physical memory space can
be modified on an existing slot, but we don't enforce that the
userspace address cannot be modified. Instead we just ignore it.
This means that a user may think they've successfully moved both the
guest and user addresses, when in fact only the guest address changed.
Check and error instead.Reviewed-by: Gleb Natapov
Signed-off-by: Alex Williamson
Signed-off-by: Marcelo Tosatti -
The API documentation states:
When changing an existing slot, it may be moved in the guest
physical memory space, or its flags may be modified.An "existing slot" requires a non-zero npages (memory_size). The only
transition we should therefore allow for a non-existing slot should be
to create the slot, which includes setting a non-zero memory_size. We
currently allow calls to modify non-existing slots, which is pointless,
confusing, and possibly wrong.With this we know that the invalidation path of __kvm_set_memory_region
is always for a delete or move and never for adding a zero size slot.Reviewed-by: Gleb Natapov
Signed-off-by: Alex Williamson
Signed-off-by: Marcelo Tosatti
11 Dec, 2012
1 commit
-
Typo for the next pointer means we're walking random data here.
Signed-off-by: Alex Williamson
Signed-off-by: Marcelo Tosatti
10 Dec, 2012
1 commit
-
* 'for-upstream' of https://github.com/agraf/linux-2.6: (28 commits)
KVM: PPC: booke: Get/set guest EPCR register using ONE_REG interface
KVM: PPC: bookehv: Add EPCR support in mtspr/mfspr emulation
KVM: PPC: bookehv: Add guest computation mode for irq delivery
KVM: PPC: Make EPCR a valid field for booke64 and bookehv
KVM: PPC: booke: Extend MAS2 EPN mask for 64-bit
KVM: PPC: e500: Mask MAS2 EPN high 32-bits in 32/64 tlbwe emulation
KVM: PPC: Mask ea's high 32-bits in 32/64 instr emulation
KVM: PPC: e500: Add emulation helper for getting instruction ea
KVM: PPC: bookehv64: Add support for interrupt handling
KVM: PPC: bookehv: Remove GET_VCPU macro from exception handler
KVM: PPC: booke: Fix get_tb() compile error on 64-bit
KVM: PPC: e500: Silence bogus GCC warning in tlb code
KVM: PPC: Book3S HV: Handle guest-caused machine checks on POWER7 without panicking
KVM: PPC: Book3S HV: Improve handling of local vs. global TLB invalidations
MAINTAINERS: Add git tree link for PPC KVM
KVM: PPC: Book3S PR: MSR_DE doesn't exist on Book 3S
KVM: PPC: Book3S PR: Fix VSX handling
KVM: PPC: Book3S PR: Emulate PURR, SPURR and DSCR registers
KVM: PPC: Book3S HV: Don't give the guest RW access to RO pages
KVM: PPC: Book3S HV: Report correct HPT entry index when reading HPT
...
06 Dec, 2012
1 commit
-
The current eventfd code assumes that when we have eventfd, we also have
irqfd for in-kernel interrupt delivery. This is not necessarily true. On
PPC we don't have an in-kernel irqchip yet, but we can still support easily
support eventfd.Signed-off-by: Alexander Graf
05 Dec, 2012
2 commits
-
We can deliver certain interrupts, notably MSI,
from atomic context. Use kvm_set_irq_inatomic,
to implement an irq handler for msi.This reduces the pressure on scheduler in case
where host and guest irq share a host cpu.Signed-off-by: Michael S. Tsirkin
Signed-off-by: Gleb Natapov -
Add an API to inject IRQ from atomic context.
Return EWOULDBLOCK if impossible (e.g. for multicast).
Only MSI is supported ATM.Signed-off-by: Michael S. Tsirkin
Signed-off-by: Gleb Natapov
30 Nov, 2012
1 commit
-
Prior to memory slot sorting this loop compared all of the user memory
slots for overlap with new entries. With memory slot sorting, we're
just checking some number of entries in the array that may or may not
be user slots. Instead, walk all the slots with kvm_for_each_memslot,
which has the added benefit of terminating early when we hit the first
empty slot, and skip comparison to private slots.Cc: stable@vger.kernel.org
Signed-off-by: Alex Williamson
Signed-off-by: Marcelo Tosatti
28 Nov, 2012
2 commits
-
TSC initialization will soon make use of online_vcpus.
Signed-off-by: Marcelo Tosatti
-
KVM added a global variable to guarantee monotonicity in the guest.
One of the reasons for that is that the time between1. ktime_get_ts(×pec);
2. rdtscll(tsc);Is variable. That is, given a host with stable TSC, suppose that
two VCPUs read the same time via ktime_get_ts() above.The time required to execute 2. is not the same on those two instances
executing in different VCPUS (cache misses, interrupts...).If the TSC value that is used by the host to interpolate when
calculating the monotonic time is the same value used to calculate
the tsc_timestamp value stored in the pvclock data structure, and
a single tuple is visible to all
vcpus simultaneously, this problem disappears. See comment on top
of pvclock_update_vm_gtod_copy for details.Monotonicity is then guaranteed by synchronicity of the host TSCs
and guest TSCs.Set TSC stable pvclock flag in that case, allowing the guest to read
clock from userspace.Signed-off-by: Marcelo Tosatti
14 Nov, 2012
2 commits
-
No need to check return value before breaking switch.
Signed-off-by: Guo Chao
Signed-off-by: Marcelo Tosatti -
We should avoid kfree()ing error pointer in kvm_vcpu_ioctl() and
kvm_arch_vcpu_ioctl().Signed-off-by: Guo Chao
Signed-off-by: Marcelo Tosatti
30 Oct, 2012
2 commits
-
This patch filters noslot pfn out from error pfns based on Marcelo comment:
noslot pfn is not a error pfnAfter this patch,
- is_noslot_pfn indicates that the gfn is not in slot
- is_error_pfn indicates that the gfn is in slot but the error is occurred
when translate the gfn to pfn
- is_error_noslot_pfn indicates that the pfn either it is error pfns or it
is noslot pfn
And is_invalid_pfn can be removed, it makes the code more cleanSigned-off-by: Xiao Guangrong
Signed-off-by: Marcelo Tosatti -
Merge reason: development work has dependency on kvm patches merged
upstream.Conflicts:
arch/powerpc/include/asm/Kbuild
arch/powerpc/include/asm/kvm_para.hSigned-off-by: Marcelo Tosatti
24 Oct, 2012
1 commit
-
Pull kvm fixes from Avi Kivity:
"KVM updates for 3.7-rc2"* tag 'kvm-3.7-2' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
KVM guest: exit idleness when handling KVM_PV_REASON_PAGE_NOT_PRESENT
KVM: apic: fix LDR calculation in x2apic mode
KVM: MMU: fix release noslot pfn
23 Oct, 2012
1 commit
-
We can not directly call kvm_release_pfn_clean to release the pfn
since we can meet noslot pfn which is used to cache mmio info into
spteSigned-off-by: Xiao Guangrong
Cc: stable@vger.kernel.org
Signed-off-by: Avi Kivity
11 Oct, 2012
1 commit
-
Change existing kernel error message to include return value from
iommu_attach_device() when it fails. This will help debug device
assignment failures more effectively.Signed-off-by: Shuah Khan
Signed-off-by: Marcelo Tosatti
06 Oct, 2012
1 commit
-
Now that we have defined generic set_bit_le() we do not need to use
test_and_set_bit_le() for atomically setting a bit.Signed-off-by: Takuya Yoshikawa
Cc: Avi Kivity
Cc: Marcelo Tosatti
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds