14 Jan, 2011
3 commits
-
Cleanup some code with common compound_trans_head helper.
Signed-off-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Marcelo Tosatti
Cc: Avi Kivity
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
For GRU and EPT, we need gup-fast to set referenced bit too (this is why
it's correct to return 0 when shadow_access_mask is zero, it requires
gup-fast to set the referenced bit). qemu-kvm access already sets the
young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow
paging EPT minor fault we relay on gup-fast to signal the page is in
use...We also need to check the young bits on the secondary pagetables for NPT
and not nested shadow mmu as the data may never get accessed again by the
primary pte.Without this closer accuracy, we'd have to remove the heuristic that
avoids collapsing hugepages in hugepage virtual regions that have not even
a single subpage in use.->test_young is full backwards compatible with GRU and other usages that
don't have young bits in pagetables set by the hardware and that should
nuke the secondary mmu mappings when ->clear_flush_young runs just like
EPT does.Removing the heuristic that checks the young bit in
khugepaged/collapse_huge_page completely isn't so bad either probably but
I thought it was worth it and this makes it reliable.Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This should work for both hugetlbfs and transparent hugepages.
[akpm@linux-foundation.org: bring forward PageTransCompound() addition for bisectability]
Signed-off-by: Andrea Arcangeli
Cc: Avi Kivity
Cc: Marcelo Tosatti
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Jan, 2011
28 commits
-
Since vmx blocks INIT signals, we disable virtualization extensions during
reboot. This leads to virtualization instructions faulting; we trap these
faults and spin while the reboot continues.Unfortunately spinning on a non-preemptible kernel may block a task that
reboot depends on; this causes the reboot to hang.Fix by skipping over the instruction and hoping for the best.
Signed-off-by: Avi Kivity
-
Quote from Avi:
| I don't think we need to flush immediately; set a "tlb dirty" bit somewhere
| that is cleareded when we flush the tlb. kvm_mmu_notifier_invalidate_page()
| can consult the bit and force a flush if set.Signed-off-by: Xiao Guangrong
Signed-off-by: Marcelo Tosatti -
Store irq routing table pointer in the irqfd object,
and use that to inject MSI directly without bouncing out to
a kernel thread.While we touch this structure, rearrange irqfd fields to make fastpath
better packed for better cache utilization.This also adds some comments about locking rules and rcu usage in code.
Some notes on the design:
- Use pointer into the rt instead of copying an entry,
to make it possible to use rcu, thus side-stepping
locking complexities. We also save some memory this way.
- Old workqueue code is still used for level irqs.
I don't think we DTRT with level anyway, however,
it seems easier to keep the code around as
it has been thought through and debugged, and fix level later than
rip out and re-instate it later.Signed-off-by: Michael S. Tsirkin
Acked-by: Marcelo Tosatti
Acked-by: Gregory Haskins
Signed-off-by: Avi Kivity -
The naming convension of hardware_[dis|en]able family is little bit confusing
because only hardware_[dis|en]able_all are using _nolock suffix.Renaming current hardware_[dis|en]able() to *_nolock() and using
hardware_[dis|en]able() as wrapper functions which take kvm_lock for them
reduces extra confusion.Signed-off-by: Takuya Yoshikawa
Signed-off-by: Marcelo Tosatti -
In kvm_cpu_hotplug(), only CPU_STARTING case is protected by kvm_lock.
This patch adds missing protection for CPU_DYING case.Signed-off-by: Takuya Yoshikawa
Signed-off-by: Marcelo Tosatti -
Any arch not supporting device assigment will also not build
assigned-dev.c. So testing for KVM_CAP_DEVICE_DEASSIGNMENT is pointless.
KVM_CAP_ASSIGN_DEV_IRQ is unconditinally set. Moreover, add a default
case for dispatching the ioctl.Acked-by: Alex Williamson
Acked-by: Michael S. Tsirkin
Signed-off-by: Jan Kiszka
Signed-off-by: Marcelo Tosatti -
The guest may change states that pci_reset_function does not touch. So
we better save/restore the assigned device across guest usage.Acked-by: Alex Williamson
Acked-by: Michael S. Tsirkin
Signed-off-by: Jan Kiszka
Signed-off-by: Marcelo Tosatti -
Cosmetic change, but it helps to correlate IRQs with PCI devices.
Acked-by: Alex Williamson
Acked-by: Michael S. Tsirkin
Signed-off-by: Jan Kiszka
Signed-off-by: Marcelo Tosatti -
This improves the IRQ forwarding for assigned devices: By using the
kernel's threaded IRQ scheme, we can get rid of the latency-prone work
queue and simplify the code in the same run.Moreover, we no longer have to hold assigned_dev_lock while raising the
guest IRQ, which can be a lenghty operation as we may have to iterate
over all VCPUs. The lock is now only used for synchronizing masking vs.
unmasking of INTx-type IRQs, thus is renames to intx_lock.Acked-by: Alex Williamson
Acked-by: Michael S. Tsirkin
Signed-off-by: Jan Kiszka
Signed-off-by: Marcelo Tosatti -
When we deassign a guest IRQ, clear the potentially asserted guest line.
There might be no chance for the guest to do this, specifically if we
switch from INTx to MSI mode.Acked-by: Alex Williamson
Acked-by: Michael S. Tsirkin
Signed-off-by: Jan Kiszka
Signed-off-by: Marcelo Tosatti -
IA64 support forces us to abstract the allocation of the kvm structure.
But instead of mixing this up with arch-specific initialization and
doing the same on destruction, split both steps. This allows to move
generic destruction calls into generic code.It also fixes error clean-up on failures of kvm_create_vm for IA64.
Signed-off-by: Jan Kiszka
Signed-off-by: Avi Kivity -
Signed-off-by: Jan Kiszka
Signed-off-by: Avi Kivity -
In kvm_async_pf_wakeup_all(), we add a dummy apf to vcpu->async_pf.done
without holding vcpu->async_pf.lock, it will break if we are handling apfs
at this time.Also use 'list_empty_careful()' instead of 'list_empty()'
Signed-off-by: Xiao Guangrong
Acked-by: Gleb Natapov
Signed-off-by: Marcelo Tosatti -
If it's no need to inject async #PF to PV guest we can handle
more completed apfs at one time, so we can retry guest #PF
as early as possibleSigned-off-by: Xiao Guangrong
Acked-by: Gleb Natapov
Signed-off-by: Marcelo Tosatti -
Let's use newly introduced vzalloc().
Signed-off-by: Takuya Yoshikawa
Signed-off-by: Jesper Juhl
Signed-off-by: Marcelo Tosatti -
Fixes this:
CC arch/s390/kvm/../../../virt/kvm/kvm_main.o
arch/s390/kvm/../../../virt/kvm/kvm_main.c: In function 'kvm_dev_ioctl_create_vm':
arch/s390/kvm/../../../virt/kvm/kvm_main.c:1828:10: warning: unused variable 'r'Signed-off-by: Heiko Carstens
Signed-off-by: Marcelo Tosatti -
Fixes this:
CC arch/s390/kvm/../../../virt/kvm/kvm_main.o
arch/s390/kvm/../../../virt/kvm/kvm_main.c: In function 'kvm_clear_guest_page':
arch/s390/kvm/../../../virt/kvm/kvm_main.c:1224:2: warning: passing argument 3 of 'kvm_write_guest_page' makes pointer from integer without a cast
arch/s390/kvm/../../../virt/kvm/kvm_main.c:1185:5: note: expected 'const void *' but argument is of type 'long unsigned int'Signed-off-by: Heiko Carstens
Signed-off-by: Marcelo Tosatti -
Currently we are using vmalloc() for all dirty bitmaps even if
they are small enough, say less than K bytes.We use kmalloc() if dirty bitmap size is less than or equal to
PAGE_SIZE so that we can avoid vmalloc area usage for VGA.This will also make the logging start/stop faster.
Signed-off-by: Takuya Yoshikawa
Signed-off-by: Marcelo Tosatti -
Currently x86's kvm_vm_ioctl_get_dirty_log() needs to allocate a bitmap by
vmalloc() which will be used in the next logging and this has been causing
bad effect to VGA and live-migration: vmalloc() consumes extra systime,
triggers tlb flush, etc.This patch resolves this issue by pre-allocating one more bitmap and switching
between two bitmaps during dirty logging.Performance improvement:
I measured performance for the case of VGA update by trace-cmd.
The result was 1.5 times faster than the original one.In the case of live migration, the improvement ratio depends on the workload
and the guest memory size. In general, the larger the memory size is the more
benefits we get.Note:
This does not change other architectures's logic but the allocation size
becomes twice. This will increase the actual memory consumption only when
the new size changes the number of pages allocated by vmalloc().Signed-off-by: Takuya Yoshikawa
Signed-off-by: Fernando Luis Vazquez Cao
Signed-off-by: Marcelo Tosatti -
This makes it easy to change the way of allocating/freeing dirty bitmaps.
Signed-off-by: Takuya Yoshikawa
Signed-off-by: Fernando Luis Vazquez Cao
Signed-off-by: Marcelo Tosatti -
Add tracepoint for userspace exit.
Signed-off-by: Gleb Natapov
Signed-off-by: Marcelo Tosatti -
As suggested by Andrea, pass r/w error code to gup(), upgrading read fault
to writable if host pte allows it.Signed-off-by: Marcelo Tosatti
Signed-off-by: Avi Kivity -
Improve vma handling code readability in hva_to_pfn() and fix
async pf handling code to properly check vma returned by find_vma().Signed-off-by: Gleb Natapov
Signed-off-by: Marcelo Tosatti -
Send async page fault to a PV guest if it accesses swapped out memory.
Guest will choose another task to run upon receiving the fault.Allow async page fault injection only when guest is in user mode since
otherwise guest may be in non-sleepable context and will not be able
to reschedule.Vcpu will be halted if guest will fault on the same page again or if
vcpu executes kernel code.Acked-by: Rik van Riel
Signed-off-by: Gleb Natapov
Signed-off-by: Marcelo Tosatti -
Guest enables async PF vcpu functionality using this MSR.
Reviewed-by: Rik van Riel
Signed-off-by: Gleb Natapov
Signed-off-by: Marcelo Tosatti -
Keep track of memslots changes by keeping generation number in memslots
structure. Provide kvm_write_guest_cached() function that skips
gfn_to_hva() translation if memslots was not changed since previous
invocation.Acked-by: Rik van Riel
Signed-off-by: Gleb Natapov
Signed-off-by: Marcelo Tosatti -
When page is swapped in it is mapped into guest memory only after guest
tries to access it again and generate another fault. To save this fault
we can map it immediately since we know that guest is going to access
the page. Do it only when tdp is enabled for now. Shadow paging case is
more complicated. CR[034] and EFER registers should be switched before
doing mapping and then switched back.Acked-by: Rik van Riel
Signed-off-by: Gleb Natapov
Signed-off-by: Marcelo Tosatti -
If a guest accesses swapped out memory do not swap it in from vcpu thread
context. Schedule work to do swapping and put vcpu into halted state
instead.Interrupts will still be delivered to the guest and if interrupt will
cause reschedule guest will continue to run another task.[avi: remove call to get_user_pages_noio(), nacked by Linus; this
makes everything synchrnous again]Acked-by: Rik van Riel
Signed-off-by: Gleb Natapov
Signed-off-by: Marcelo Tosatti
25 Oct, 2010
1 commit
-
* 'kvm-updates/2.6.37' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (321 commits)
KVM: Drop CONFIG_DMAR dependency around kvm_iommu_map_pages
KVM: Fix signature of kvm_iommu_map_pages stub
KVM: MCE: Send SRAR SIGBUS directly
KVM: MCE: Add MCG_SER_P into KVM_MCE_CAP_SUPPORTED
KVM: fix typo in copyright notice
KVM: Disable interrupts around get_kernel_ns()
KVM: MMU: Avoid sign extension in mmu_alloc_direct_roots() pae root address
KVM: MMU: move access code parsing to FNAME(walk_addr) function
KVM: MMU: audit: check whether have unsync sps after root sync
KVM: MMU: audit: introduce audit_printk to cleanup audit code
KVM: MMU: audit: unregister audit tracepoints before module unloaded
KVM: MMU: audit: fix vcpu's spte walking
KVM: MMU: set access bit for direct mapping
KVM: MMU: cleanup for error mask set while walk guest page table
KVM: MMU: update 'root_hpa' out of loop in PAE shadow path
KVM: x86 emulator: Eliminate compilation warning in x86_decode_insn()
KVM: x86: Fix constant type in kvm_get_time_scale
KVM: VMX: Add AX to list of registers clobbered by guest switch
KVM guest: Move a printk that's using the clock before it's ready
KVM: x86: TSC catchup mode
...
24 Oct, 2010
7 commits
-
We also have to call kvm_iommu_map_pages for CONFIG_AMD_IOMMU. So drop
the dependency on Intel IOMMU, kvm_iommu_map_pages will be a nop anyway
if CONFIG_IOMMU_API is not defined.KVM-Stable-Tag.
Signed-off-by: Jan Kiszka
Signed-off-by: Marcelo Tosatti -
Fix typo in copyright notice.
Signed-off-by: Nicolas Kaiser
Signed-off-by: Marcelo Tosatti -
It doesn't really matter, but if we spin, we should spin in a more relaxed
manner. This way, if something goes wrong at least it won't contribute to
global warming.Signed-off-by: Avi Kivity
Signed-off-by: Marcelo Tosatti -
There is a bugs in this function, we call gfn_to_pfn() and kvm_mmu_gva_to_gpa_read() in
atomic context(kvm_mmu_audit() is called under the spinlock(mmu_lock)'s protection).This patch fix it by:
- introduce gfn_to_pfn_atomic instead of gfn_to_pfn
- get the mapping gfn from kvm_mmu_page_get_gfn()And it adds 'notrap' ptes check in unsync/direct sps
Signed-off-by: Xiao Guangrong
Signed-off-by: Avi Kivity -
Introduce this function to get consecutive gfn's pages, it can reduce
gup's overload, used by later patchSigned-off-by: Xiao Guangrong
Signed-off-by: Marcelo Tosatti -
Introduce hva_to_pfn_atomic(), it's the fast path and can used in atomic
context, the later patch will use itSigned-off-by: Xiao Guangrong
Signed-off-by: Marcelo Tosatti -
If there are active VCPUs which are marked as belonging to
a particular hardware CPU, request a clock sync for them when
enabling hardware; the TSC could be desynchronized on a newly
arriving CPU, and we need to recompute guests system time
relative to boot after a suspend event.This covers both cases.
Note that it is acceptable to take the spinlock, as either
no other tasks will be running and no locks held (BSP after
resume), or other tasks will be guaranteed to drop the lock
relatively quickly (AP on CPU_STARTING).Noting we now get clock synchronization requests for VCPUs
which are starting up (or restarting), it is tempting to
attempt to remove the arch/x86/kvm/x86.c CPU hot-notifiers
at this time, however it is not correct to do so; they are
required for systems with non-constant TSC as the frequency
may not be known immediately after the processor has started
until the cpufreq driver has had a chance to run and query
the chipset.Updated: implement better locking semantics for hardware_enable
Removed the hack of dropping and retaking the lock by adding the
semantic that we always hold kvm_lock when hardware_enable is
called. The one place that doesn't need to worry about it is
resume, as resuming a frozen CPU, the spinlock won't be taken.Signed-off-by: Zachary Amsden
Signed-off-by: Marcelo Tosatti
23 Oct, 2010
1 commit
-
* 'llseek' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/bkl:
vfs: make no_llseek the default
vfs: don't use BKL in default_llseek
llseek: automatically add .llseek fop
libfs: use generic_file_llseek for simple_attr
mac80211: disallow seeks in minstrel debug code
lirc: make chardev nonseekable
viotape: use noop_llseek
raw: use explicit llseek file operations
ibmasmfs: use generic_file_llseek
spufs: use llseek in all file operations
arm/omap: use generic_file_llseek in iommu_debug
lkdtm: use generic_file_llseek in debugfs
net/wireless: use generic_file_llseek in debugfs
drm: use noop_llseek