21 Nov, 2013
1 commit
-
Using the address of 'empty_zero_page' as source address in order to
clear a page is wrong. On some architectures empty_zero_page is only the
pointer to the struct page of the empty_zero_page. Therefore the clear
page operation would copy the contents of a couple of struct pages instead
of clearing a page. For kvm only arm/arm64 are affected by this bug.To fix this use the ZERO_PAGE macro instead which will return the struct
page address of the empty_zero_page on all architectures.Signed-off-by: Heiko Carstens
Signed-off-by: Gleb Natapov
15 Nov, 2013
1 commit
-
Pull KVM changes from Paolo Bonzini:
"Here are the 3.13 KVM changes. There was a lot of work on the PPC
side: the HV and emulation flavors can now coexist in a single kernel
is probably the most interesting change from a user point of view.On the x86 side there are nested virtualization improvements and a few
bugfixes.ARM got transparent huge page support, improved overcommit, and
support for big endian guests.Finally, there is a new interface to connect KVM with VFIO. This
helps with devices that use NoSnoop PCI transactions, letting the
driver in the guest execute WBINVD instructions. This includes some
nVidia cards on Windows, that fail to start without these patches and
the corresponding userspace changes"* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (146 commits)
kvm, vmx: Fix lazy FPU on nested guest
arm/arm64: KVM: PSCI: propagate caller endianness to the incoming vcpu
arm/arm64: KVM: MMIO support for BE guest
kvm, cpuid: Fix sparse warning
kvm: Delete prototype for non-existent function kvm_check_iopl
kvm: Delete prototype for non-existent function complete_pio
hung_task: add method to reset detector
pvclock: detect watchdog reset at pvclock read
kvm: optimize out smp_mb after srcu_read_unlock
srcu: API for barrier after srcu read unlock
KVM: remove vm mmap method
KVM: IOMMU: hva align mapping page size
KVM: x86: trace cpuid emulation when called from emulator
KVM: emulator: cleanup decode_register_operand() a bit
KVM: emulator: check rex prefix inside decode_register()
KVM: x86: fix emulation of "movzbl %bpl, %eax"
kvm_host: typo fix
KVM: x86: emulate SAHF instruction
MAINTAINERS: add tree for kvm.git
Documentation/kvm: add a 00-INDEX file
...
06 Nov, 2013
1 commit
-
It was used in conjunction with KVM_SET_MEMORY_REGION ioctl which was
removed by b74a07beed0 in 2010, QEMU stopped using it in 2008, so
it is time to remove the code finally.Signed-off-by: Gleb Natapov
05 Nov, 2013
1 commit
-
When determining the page size we could use to map with the IOMMU, the
page size should also be aligned with the hva, not just the gfn. The
gfn may not reflect the real alignment within the hugetlbfs file.Most of the time, this works fine. However, if the hugetlbfs file is
backed by non-contiguous huge pages, a multi-huge page memslot starts at
an unaligned offset within the hugetlbfs file, and the gfn is aligned
with respect to the huge page size, kvm_host_page_size() will return the
huge page size and we will use that to map with the IOMMU.When we later unpin that same memslot, the IOMMU returns the unmap size
as the huge page size, and we happily unpin that many pfns in
monotonically increasing order, not realizing we are spanning
non-contiguous huge pages and partially unpin the wrong huge page.Ensure the IOMMU mapping page size is aligned with the hva corresponding
to the gfn, which does reflect the alignment within the hugetlbfs file.Reviewed-by: Marcelo Tosatti
Signed-off-by: Greg Edwards
Cc: stable@vger.kernel.org
Signed-off-by: Gleb Natapov
04 Nov, 2013
1 commit
-
Conflicts:
arch/powerpc/include/asm/processor.h
31 Oct, 2013
3 commits
-
We currently use some ad-hoc arch variables tied to legacy KVM device
assignment to manage emulation of instructions that depend on whether
non-coherent DMA is present. Create an interface for this, adapting
legacy KVM device assignment and adding VFIO via the KVM-VFIO device.
For now we assume that non-coherent DMA is possible any time we have a
VFIO group. Eventually an interface can be developed as part of the
VFIO external user interface to query the coherency of a group.Signed-off-by: Alex Williamson
Signed-off-by: Paolo Bonzini -
Default to operating in coherent mode. This simplifies the logic when
we switch to a model of registering and unregistering noncoherent I/O
with KVM.Signed-off-by: Alex Williamson
Signed-off-by: Paolo Bonzini -
So far we've succeeded at making KVM and VFIO mostly unaware of each
other, but areas are cropping up where a connection beyond eventfds
and irqfds needs to be made. This patch introduces a KVM-VFIO device
that is meant to be a gateway for such interaction. The user creates
the device and can add and remove VFIO groups to it via file
descriptors. When a group is added, KVM verifies the group is valid
and gets a reference to it via the VFIO external user interface.Signed-off-by: Alex Williamson
Signed-off-by: Paolo Bonzini
30 Oct, 2013
1 commit
-
I don't know if this was due to cut and paste, or somebody was really
using a D20 to pick the error code for kvm_init_debugfs as suggested by
Linus (EFAULT is 14, so the possibility cannot be entirely ruled out).In any case, this patch fixes it.
Reported-by: Tim Gardner
Signed-off-by: Paolo Bonzini
28 Oct, 2013
1 commit
-
In kvm_iommu_map_pages(), we need to know the page size via call
kvm_host_page_size(). And it will check whether the target slot
is valid before return the right page size.
Currently, we will map the iommu pages when creating a new slot.
But we call kvm_iommu_map_pages() during preparing the new slot.
At that time, the new slot is not visible by domain(still in preparing).
So we cannot get the right page size from kvm_host_page_size() and
this will break the IOMMU super page logic.
The solution is to map the iommu pages after we insert the new slot
into domain.Signed-off-by: Yang Zhang
Tested-by: Patrick Lu
Signed-off-by: Paolo Bonzini
17 Oct, 2013
3 commits
-
Merging master into next to satisfy the dependencies.
Conflicts:
arch/arm/kvm/reset.c -
We will use that in the later patch to find the kvm ops handler
Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Alexander Graf -
Signed-off-by: Aneesh Kumar K.V
[agraf: squash in compile fix]
Signed-off-by: Alexander Graf
15 Oct, 2013
1 commit
-
Page pinning is not mandatory in kvm async page fault processing since
after async page fault event is delivered to a guest it accesses page once
again and does its own GUP. Drop the FOLL_GET flag in GUP in async_pf
code, and do some simplifying in check/clear processing.Suggested-by: Gleb Natapov
Signed-off-by: Gu zheng
Signed-off-by: chai wen
Signed-off-by: Gleb Natapov
03 Oct, 2013
2 commits
-
When KVM (de)assigns PCI(e) devices to VMs, a debug message is printed
including the BDF notation of the respective device. Currently, the BDF
notation does not have the commonly used leading zeros. This produces
messages like "assign device 0:1:8.0", which look strange at first sight.The patch fixes this by exchanging the printk(KERN_DEBUG ...) with dev_info()
and also inserts "kvm" into the debug message, so that it is obvious where
the message comes from. Also reduces LoC.Acked-by: Alex Williamson
Signed-off-by: Andre Richter
Signed-off-by: Gleb Natapov -
gfn_to_memslot() can return NULL or invalid slot. We need to check slot
validity before accessing it.Reviewed-by: Paolo Bonzini
Signed-off-by: Gleb Natapov
30 Sep, 2013
3 commits
-
In commit e935b8372cf8 ("KVM: Convert kvm_lock to raw_spinlock"),
the kvm_lock was made a raw lock. However, the kvm mmu_shrink()
function tries to grab the (non-raw) mmu_lock within the scope of
the raw locked kvm_lock being held. This leads to the following:BUG: sleeping function called from invalid context at kernel/rtmutex.c:659
in_atomic(): 1, irqs_disabled(): 0, pid: 55, name: kswapd0
Preemption disabled at:[] mmu_shrink+0x5c/0x1b0 [kvm]Pid: 55, comm: kswapd0 Not tainted 3.4.34_preempt-rt
Call Trace:
[] __might_sleep+0xfd/0x160
[] rt_spin_lock+0x24/0x50
[] mmu_shrink+0xec/0x1b0 [kvm]
[] shrink_slab+0x17d/0x3a0
[] ? mem_cgroup_iter+0x130/0x260
[] balance_pgdat+0x54a/0x730
[] ? set_pgdat_percpu_threshold+0xa7/0xd0
[] kswapd+0x18f/0x490
[] ? get_parent_ip+0x11/0x50
[] ? __init_waitqueue_head+0x50/0x50
[] ? balance_pgdat+0x730/0x730
[] kthread+0xdb/0xe0
[] ? finish_task_switch+0x52/0x100
[] kernel_thread_helper+0x4/0x10
[] ? __init_kthread_worker+0xAfter the previous patch, kvm_lock need not be a raw spinlock anymore,
so change it back.Reported-by: Paul Gortmaker
Cc: kvm@vger.kernel.org
Cc: gleb@redhat.com
Cc: jan.kiszka@siemens.com
Reviewed-by: Gleb Natapov
Signed-off-by: Paolo Bonzini -
The VM list need not be protected by a raw spinlock. Separate the
two so that kvm_lock can be made non-raw.Cc: kvm@vger.kernel.org
Cc: gleb@redhat.com
Cc: jan.kiszka@siemens.com
Reviewed-by: Gleb Natapov
Signed-off-by: Paolo Bonzini -
Remove the useless argument, and do not do anything if there are no
VMs running at the time of the hotplug.Cc: kvm@vger.kernel.org
Cc: gleb@redhat.com
Cc: jan.kiszka@siemens.com
Reviewed-by: Gleb Natapov
Signed-off-by: Paolo Bonzini
25 Sep, 2013
1 commit
-
'.done' is used to mark the completion of 'async_pf_execute()', but
'cancel_work_sync()' returns true when the work was canceled, so we
use it instead.Signed-off-by: Radim Krčmář
Reviewed-by: Paolo Bonzini
Reviewed-by: Gleb Natapov
Signed-off-by: Paolo Bonzini
17 Sep, 2013
2 commits
-
When we cancel 'async_pf_execute()', we should behave as if the work was
never scheduled in 'kvm_setup_async_pf()'.
Fixes a bug when we can't unload module because the vm wasn't destroyed.Signed-off-by: Radim Krčmář
Reviewed-by: Paolo Bonzini
Reviewed-by: Gleb Natapov
Signed-off-by: Paolo Bonzini -
Page tables in a read-only memory slot will currently cause a triple
fault because the page walker uses gfn_to_hva and it fails on such a slot.OVMF uses such a page table; however, real hardware seems to be fine with
that as long as the accessed/dirty bits are set. Save whether the slot
is readonly, and later check it when updating the accessed and dirty bits.Reviewed-by: Xiao Guangrong
Reviewed-by: Gleb Natapov
Signed-off-by: Paolo Bonzini
05 Sep, 2013
1 commit
-
Pull vfs pile 1 from Al Viro:
"Unfortunately, this merge window it'll have a be a lot of small piles -
my fault, actually, for not keeping #for-next in anything that would
resemble a sane shape ;-/This pile: assorted fixes (the first 3 are -stable fodder, IMO) and
cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last
components) + several long-standing patches from various folks.There definitely will be a lot more (starting with Miklos'
check_submount_and_drop() series)"* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits)
direct-io: Handle O_(D)SYNC AIO
direct-io: Implement generic deferred AIO completions
add formats for dentry/file pathnames
kvm eventfd: switch to fdget
powerpc kvm: use fdget
switch fchmod() to fdget
switch epoll_ctl() to fdget
switch copy_module_from_fd() to fdget
git simplify nilfs check for busy subtree
ibmasmfs: don't bother passing superblock when not needed
don't pass superblock to hypfs_{mkdir,create*}
don't pass superblock to hypfs_diag_create_files
don't pass superblock to hypfs_vm_create_files()
oprofile: get rid of pointless forward declarations of struct super_block
oprofilefs_create_...() do not need superblock argument
oprofilefs_mkdir() doesn't need superblock argument
don't bother with passing superblock to oprofile_create_stats_files()
oprofile: don't bother with passing superblock to ->create_files()
don't bother passing sb to oprofile_create_files()
coh901318: don't open-code simple_read_from_buffer()
...
04 Sep, 2013
1 commit
-
Signed-off-by: Al Viro
30 Aug, 2013
3 commits
-
For bytemaps each IRQ field is 1 byte wide, so we pack 4 irq fields in
one word and since there are 32 private (per cpu) irqs, we have 8
private u32 fields on the vgic_bytemap struct. We shift the offset from
the base of the register group right by 2, giving us the word index
instead of the field index. But then there are 8 private words, not 4,
which is also why we subtract 8 words from the offset of the shared
words.Signed-off-by: Christoffer Dall
Signed-off-by: Marc Zyngier
Signed-off-by: Gleb Natapov -
All the code in handle_mmio_cfg_reg() assumes the offset has
been shifted right to accomodate for the 2:1 bit compression,
but this is only done when getting the register address.Shift the offset early so the code works mostly unchanged.
Reported-by: Zhaobo (Bob, ERC)
Signed-off-by: Marc Zyngier
Signed-off-by: Gleb Natapov -
vgic_get_target_reg is quite complicated, for no good reason.
Actually, it is fairly easy to write it in a much more efficient
way by using the target CPU array instead of the bitmap.Signed-off-by: Marc Zyngier
Signed-off-by: Gleb Natapov
28 Aug, 2013
1 commit
-
This is the type-safe comparison function, so the double-underscore is
not related.Signed-off-by: Paolo Bonzini
Signed-off-by: Gleb Natapov
27 Aug, 2013
1 commit
-
The checks on PG_reserved in the page structure on head and tail pages
aren't necessary because split_huge_page wouldn't transfer the
PG_reserved bit from head to tail anyway.This was a forward-thinking check done in the case PageReserved was
set by a driver-owned page mapped in userland with something like
remap_pfn_range in a VM_PFNMAP region, but using hugepmds (not
possible right now). It was meant to be very safe, but it's overkill
as it's unlikely split_huge_page could ever run without the driver
noticing and tearing down the hugepage itself.And if a driver in the future will really want to map a reserved
hugepage in userland using an huge pmd it should simply take care of
marking all subpages reserved too to keep KVM safe. This of course
would require such a hypothetical driver to tear down the huge pmd
itself and splitting the hugepage itself, instead of relaying on
split_huge_page, but that sounds very reasonable, especially
considering split_huge_page wouldn't currently transfer the reserved
bit anyway.Signed-off-by: Andrea Arcangeli
Signed-off-by: Gleb Natapov
26 Aug, 2013
1 commit
-
KVM uses anon_inode_get() to allocate file descriptors as part
of some of its ioctls. But those ioctls are lacking a flag argument
allowing userspace to choose options for the newly opened file descriptor.In such case it's advised to use O_CLOEXEC by default so that
userspace is allowed to choose, without race, if the file descriptor
is going to be inherited across exec().This patch set O_CLOEXEC flag on all file descriptors created
with anon_inode_getfd() to not leak file descriptors across exec().Signed-off-by: Yann Droneaud
Link: http://lkml.kernel.org/r/cover.1377372576.git.ydroneaud@opteya.com
Reviewed-by: Paolo Bonzini
Signed-off-by: Gleb Natapov
29 Jul, 2013
1 commit
-
kvm_io_bus_sort_cmp is used also directly, not just as a callback for
sort and bsearch. In these cases, it is handy to have a type-safe
variant. This patch introduces such a variant, __kvm_io_bus_sort_cmp,
and uses it throughout kvm_main.c.Signed-off-by: Paolo Bonzini
18 Jul, 2013
2 commits
-
This is called right after the memslots is updated, i.e. when the result
of update_memslots() gets installed in install_new_memslots(). Since
the memslots needs to be updated twice when we delete or move a memslot,
kvm_arch_commit_memory_region() does not correspond to this exactly.In the following patch, x86 will use this new API to check if the mmio
generation has reached its maximum value, in which case mmio sptes need
to be flushed out.Signed-off-by: Takuya Yoshikawa
Acked-by: Alexander Graf
Reviewed-by: Xiao Guangrong
Signed-off-by: Paolo Bonzini -
Add new functions kvm_io_bus_{read,write}_cookie() that allows users of
the kvm io infrastructure to use a cookie value to speed up lookup of a
device on an io bus.Signed-off-by: Cornelia Huck
Signed-off-by: Paolo Bonzini
Signed-off-by: Gleb Natapov
04 Jul, 2013
1 commit
-
Pull KVM fixes from Paolo Bonzini:
"On the x86 side, there are some optimizations and documentation
updates. The big ARM/KVM change for 3.11, support for AArch64, will
come through Catalin Marinas's tree. s390 and PPC have misc cleanups
and bugfixes"* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (87 commits)
KVM: PPC: Ignore PIR writes
KVM: PPC: Book3S PR: Invalidate SLB entries properly
KVM: PPC: Book3S PR: Allow guest to use 1TB segments
KVM: PPC: Book3S PR: Don't keep scanning HPTEG after we find a match
KVM: PPC: Book3S PR: Fix invalidation of SLB entry 0 on guest entry
KVM: PPC: Book3S PR: Fix proto-VSID calculations
KVM: PPC: Guard doorbell exception with CONFIG_PPC_DOORBELL
KVM: Fix RTC interrupt coalescing tracking
kvm: Add a tracepoint write_tsc_offset
KVM: MMU: Inform users of mmio generation wraparound
KVM: MMU: document fast invalidate all mmio sptes
KVM: MMU: document fast invalidate all pages
KVM: MMU: document fast page fault
KVM: MMU: document mmio page fault
KVM: MMU: document write_flooding_count
KVM: MMU: document clear_spte_count
KVM: MMU: drop kvm_mmu_zap_mmio_sptes
KVM: MMU: init kvm generation close to mmio wrap-around value
KVM: MMU: add tracepoint for check_mmio_spte
KVM: MMU: fast invalidate all mmio sptes
...
27 Jun, 2013
3 commits
-
KVM/ARM pull request for 3.11 merge window
* tag 'kvm-arm-3.11' of git://git.linaro.org/people/cdall/linux-kvm-arm.git:
ARM: kvm: don't include drivers/virtio/Kconfig
Update MAINTAINERS: KVM/ARM work now funded by Linaro
arm/kvm: Cleanup KVM_ARM_MAX_VCPUS logic
ARM: KVM: clear exclusive monitor on all exception returns
ARM: KVM: add missing dsb before invalidating Stage-2 TLBs
ARM: KVM: perform save/restore of PAR
ARM: KVM: get rid of S2_PGD_SIZE
ARM: KVM: don't special case PC when doing an MMIO
ARM: KVM: use phys_addr_t instead of unsigned long long for HYP PGDs
ARM: KVM: remove dead prototype for __kvm_tlb_flush_vmid
ARM: KVM: Don't handle PSCI calls via SMC
ARM: KVM: Allow host virt timer irq to be different from guest timer virt irq -
This reverts most of the f1ed0450a5fac7067590317cbf027f566b6ccbca. After
the commit kvm_apic_set_irq() no longer returns accurate information
about interrupt injection status if injection is done into disabled
APIC. RTC interrupt coalescing tracking relies on the information to be
accurate and cannot recover if it is not.Signed-off-by: Gleb Natapov
-
The arch_timer irq numbers (or PPI numbers) are implementation dependent,
so the host virtual timer irq number can be different from guest virtual
timer irq number.This patch ensures that host virtual timer irq number is read from DTB and
guest virtual timer irq is determined based on vcpu target type.Signed-off-by: Anup Patel
Signed-off-by: Pranavkumar Sawargaonkar
Signed-off-by: Christoffer Dall
04 Jun, 2013
1 commit
-
We can easily reach the 1000 limit by start VM with a couple
hundred I/O devices (multifunction=on). The hardcode limit
already been adjusted 3 times (6 ~ 200 ~ 300 ~ 1000).In userspace, we already have maximum file descriptor to
limit ioeventfd count. But kvm_io_bus devices also are used
for pit, pic, ioapic, coalesced_mmio. They couldn't be limited
by maximum file descriptor.Currently only ioeventfds take too much kvm_io_bus devices,
so just exclude it from counting kvm_io_range limit.Also fixed one indent issue in kvm_host.h
Signed-off-by: Amos Kong
Reviewed-by: Stefan Hajnoczi
Signed-off-by: Gleb Natapov
19 May, 2013
1 commit
-
As KVM/arm64 is looming on the horizon, it makes sense to move some
of the common code to a single location in order to reduce duplication.The code could live anywhere. Actually, most of KVM is already built
with a bunch of ugly ../../.. hacks in the various Makefiles, so we're
not exactly talking about style here. But maybe it is time to start
moving into a less ugly direction.The include files must be in a "public" location, as they are accessed
from non-KVM files (arch/arm/kernel/asm-offsets.c).For this purpose, introduce two new locations:
- virt/kvm/arm/ : x86 and ia64 already share the ioapic code in
virt/kvm, so this could be seen as a (very ugly) precedent.
- include/kvm/ : there is already an include/xen, and while the
intent is slightly different, this seems as good a location as
anyEventually, we should probably have independant Makefiles at every
levels (just like everywhere else in the kernel), but this is just
the first step.Signed-off-by: Marc Zyngier
Signed-off-by: Gleb Natapov
14 May, 2013
1 commit
-
Since the arrival of posted interrupt support we can no longer guarantee
that coalesced IRQs are always reported to the IRQ source. Moreover,
accumulated APIC timer events could cause a busy loop when a VCPU should
rather be halted. The consensus is to remove coalesced tracking from the
LAPIC.Signed-off-by: Jan Kiszka
Acked-by: Marcelo Tosatti
Signed-off-by: Gleb Natapov