Eric Lee / smarc-fsl-linux-kernel

14 Jul, 2011

1 commit

c9aaa8957 KVM: Steal time implementation ... Browse Code »

To implement steal time, we need the hypervisor to pass the guest
information about how much time was spent running other processes
outside the VM, while the vcpu had meaningful work to do - halt
time does not count.

This information is acquired through the run_delay field of
delayacct/schedstats infrastructure, that counts time spent in a
runqueue but not running.

Steal time is a per-cpu information, so the traditional MSR-based
infrastructure is used. A new msr, KVM_MSR_STEAL_TIME, holds the
memory area address containing information about steal time

This patch contains the hypervisor part of the steal time infrasructure,
and can be backported independently of the guest portion.

[avi, yongjie: export delayacct_on, to avoid build failures in some configs]

Signed-off-by: Glauber Costa
Tested-by: Eric B Munson
CC: Rik van Riel
CC: Jeremy Fitzhardinge
CC: Peter Zijlstra
CC: Anthony Liguori
Signed-off-by: Yongjie Ren
Signed-off-by: Avi Kivity

Glauber Costa
2011-07-14 17:59:14 +0800

12 Jul, 2011

39 commits

9ddabbe72 KVM: KVM Steal time guest/host interface ... Browse Code »

To implement steal time, we need the hypervisor to pass the guest information
about how much time was spent running other processes outside the VM.
This is per-vcpu, and using the kvmclock structure for that is an abuse
we decided not to make.

In this patchset, I am introducing a new msr, KVM_MSR_STEAL_TIME, that
holds the memory area address containing information about steal time

This patch contains the headers for it. I am keeping it separate to facilitate
backports to people who wants to backport the kernel part but not the
hypervisor, or the other way around.

Signed-off-by: Glauber Costa
Acked-by: Rik van Riel
Tested-by: Eric B Munson
CC: Jeremy Fitzhardinge
CC: Peter Zijlstra
CC: Anthony Liguori
Signed-off-by: Avi Kivity

Glauber Costa
2011-07-12 18:17:03 +0800
4b6b35f55 KVM: Add constant to represent KVM MSRs enabled bit in guest/host interface ... Browse Code »

This patch is simple, put in a different commit so it can be more easily
shared between guest and hypervisor. It just defines a named constant
to indicate the enable bit for KVM-specific MSRs.

Signed-off-by: Glauber Costa
Acked-by: Rik van Riel
Tested-by: Eric B Munson
CC: Jeremy Fitzhardinge
CC: Peter Zijlstra
CC: Anthony Liguori
Signed-off-by: Avi Kivity

Glauber Costa
2011-07-12 18:17:02 +0800
e03b644fe KVM: introduce kvm_read_guest_cached ... Browse Code »

Introduce kvm_read_guest_cached() function in addition to write one we
already have.

[ by glauber: export function signature in kvm header ]

Signed-off-by: Gleb Natapov
Signed-off-by: Glauber Costa
Acked-by: Rik van Riel
Tested-by: Eric Munson
Signed-off-by: Avi Kivity

Gleb Natapov
2011-07-12 18:17:01 +0800
29d03158f KVM: PPC: Remove prog_flags ... Browse Code »

Commit c8f729d408 (KVM: PPC: Deliver program interrupts right away instead
of queueing them) made away with all users of prog_flags, so we can just
remove it from the headers.

Signed-off-by: Alexander Graf

Alexander Graf
2011-07-12 18:17:00 +0800
9e368f291 KVM: PPC: book3s_hv: Add support for PPC970-family processors ... Browse Code »

This adds support for running KVM guests in supervisor mode on those
PPC970 processors that have a usable hypervisor mode. Unfortunately,
Apple G5 machines have supervisor mode disabled (MSR[HV] is forced to
1), but the YDL PowerStation does have a usable hypervisor mode.

There are several differences between the PPC970 and POWER7 in how
guests are managed. These differences are accommodated using the
CPU_FTR_ARCH_201 (PPC970) and CPU_FTR_ARCH_206 (POWER7) CPU feature
bits. Notably, on PPC970:

* The LPCR, LPID or RMOR registers don't exist, and the functions of
those registers are provided by bits in HID4 and one bit in HID0.

* External interrupts can be directed to the hypervisor, but unlike
POWER7 they are masked by MSR[EE] in non-hypervisor modes and use
SRR0/1 not HSRR0/1.

* There is no virtual RMA (VRMA) mode; the guest must use an RMO
(real mode offset) area.

* The TLB entries are not tagged with the LPID, so it is necessary to
flush the whole TLB on partition switch. Furthermore, when switching
partitions we have to ensure that no other CPU is executing the tlbie
or tlbsync instructions in either the old or the new partition,
otherwise undefined behaviour can occur.

* The PMU has 8 counters (PMC registers) rather than 6.

* The DSCR, PURR, SPURR, AMR, AMOR, UAMOR registers don't exist.

* The SLB has 64 entries rather than 32.

* There is no mediated external interrupt facility, so if we switch to
a guest that has a virtual external interrupt pending but the guest
has MSR[EE] = 0, we have to arrange to have an interrupt pending for
it so that we can get control back once it re-enables interrupts. We
do that by sending ourselves an IPI with smp_send_reschedule after
hard-disabling interrupts.

Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

Paul Mackerras
2011-07-12 18:16:59 +0800
969391c58 powerpc, KVM: Split HVMODE_206 cpu feature bit into separate HV and architecture bits ... Browse Code »

This replaces the single CPU_FTR_HVMODE_206 bit with two bits, one to
indicate that we have a usable hypervisor mode, and another to indicate
that the processor conforms to PowerISA version 2.06. We also add
another bit to indicate that the processor conforms to ISA version 2.01
and set that for PPC970 and derivatives.

Some PPC970 chips (specifically those in Apple machines) have a
hypervisor mode in that MSR[HV] is always 1, but the hypervisor mode
is not useful in the sense that there is no way to run any code in
supervisor mode (HV=0 PR=0). On these processors, the LPES0 and LPES1
bits in HID4 are always 0, and we use that as a way of detecting that
hypervisor mode is not useful.

Where we have a feature section in assembly code around code that
only applies on POWER7 in hypervisor mode, we use a construct like

END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)

The definition of END_FTR_SECTION_IFSET is such that the code will
be enabled (not overwritten with nops) only if all bits in the
provided mask are set.

Note that the CPU feature check in __tlbie() only needs to check the
ARCH_206 bit, not the HVMODE bit, because __tlbie() can only get called
if we are running bare-metal, i.e. in hypervisor mode.

Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

Paul Mackerras
2011-07-12 18:16:58 +0800
aa04b4cc5 KVM: PPC: Allocate RMAs (Real Mode Areas) at boot for use by guests ... Browse Code »

This adds infrastructure which will be needed to allow book3s_hv KVM to
run on older POWER processors, including PPC970, which don't support
the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
Offset (RMO) facility. These processors require a physically
contiguous, aligned area of memory for each guest. When the guest does
an access in real mode (MMU off), the address is compared against a
limit value, and if it is lower, the address is ORed with an offset
value (from the Real Mode Offset Register (RMOR)) and the result becomes
the real address for the access. The size of the RMA has to be one of
a set of supported values, which usually includes 64MB, 128MB, 256MB
and some larger powers of 2.

Since we are unlikely to be able to allocate 64MB or more of physically
contiguous memory after the kernel has been running for a while, we
allocate a pool of RMAs at boot time using the bootmem allocator. The
size and number of the RMAs can be set using the kvm_rma_size=xx and
kvm_rma_count=xx kernel command line options.

KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
of the pool of preallocated RMAs. The capability value is 1 if the
processor can use an RMA but doesn't require one (because it supports
the VRMA facility), or 2 if the processor requires an RMA for each guest.

This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
pool and returns a file descriptor which can be used to map the RMA. It
also returns the size of the RMA in the argument structure.

Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
ioctl calls from userspace. To cope with this, we now preallocate the
kvm->arch.ram_pginfo array when the VM is created with a size sufficient
for up to 64GB of guest memory. Subsequently we will get rid of this
array and use memory associated with each memslot instead.

This moves most of the code that translates the user addresses into
host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
to kvmppc_core_prepare_memory_region. Also, instead of having to look
up the VMA for each page in order to check the page size, we now check
that the pages we get are compound pages of 16MB. However, if we are
adding memory that is mapped to an RMA, we don't bother with calling
get_user_pages_fast and instead just offset from the base pfn for the
RMA.

Typically the RMA gets added after vcpus are created, which makes it
inconvenient to have the LPCR (logical partition control register) value
in the vcpu->arch struct, since the LPCR controls whether the processor
uses RMA or VRMA for the guest. This moves the LPCR value into the
kvm->arch struct and arranges for the MER (mediated external request)
bit, which is the only bit that varies between vcpus, to be set in
assembly code when going into the guest if there is a pending external
interrupt request.

Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

Paul Mackerras
2011-07-12 18:16:57 +0800
371fefd6f KVM: PPC: Allow book3s_hv guests to use SMT processor modes ... Browse Code »

This lifts the restriction that book3s_hv guests can only run one
hardware thread per core, and allows them to use up to 4 threads
per core on POWER7. The host still has to run single-threaded.

This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
capability. The return value of the ioctl querying this capability
is the number of vcpus per virtual CPU core (vcore), currently 4.

To use this, the host kernel should be booted with all threads
active, and then all the secondary threads should be offlined.
This will put the secondary threads into nap mode. KVM will then
wake them from nap mode and use them for running guest code (while
they are still offline). To wake the secondary threads, we send
them an IPI using a new xics_wake_cpu() function, implemented in
arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
we assume that the platform has a XICS interrupt controller and
we are using icp-native.c to drive it. Since the woken thread will
need to acknowledge and clear the IPI, we also export the base
physical address of the XICS registers using kvmppc_set_xics_phys()
for use in the low-level KVM book3s code.

When a vcpu is created, it is assigned to a virtual CPU core.
The vcore number is obtained by dividing the vcpu number by the
number of threads per core in the host. This number is exported
to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
to run the guest in single-threaded mode, it should make all vcpu
numbers be multiples of the number of threads per core.

We distinguish three states of a vcpu: runnable (i.e., ready to execute
the guest), blocked (that is, idle), and busy in host. We currently
implement a policy that the vcore can run only when all its threads
are runnable or blocked. This way, if a vcpu needs to execute elsewhere
in the kernel or in qemu, it can do so without being starved of CPU
by the other vcpus.

When a vcore starts to run, it executes in the context of one of the
vcpu threads. The other vcpu threads all go to sleep and stay asleep
until something happens requiring the vcpu thread to return to qemu,
or to wake up to run the vcore (this can happen when another vcpu
thread goes from busy in host state to blocked).

It can happen that a vcpu goes from blocked to runnable state (e.g.
because of an interrupt), and the vcore it belongs to is already
running. In that case it can start to run immediately as long as
the none of the vcpus in the vcore have started to exit the guest.
We send the next free thread in the vcore an IPI to get it to start
to execute the guest. It synchronizes with the other threads via
the vcore->entry_exit_count field to make sure that it doesn't go
into the guest if the other vcpus are exiting by the time that it
is ready to actually enter the guest.

Note that there is no fixed relationship between the hardware thread
number and the vcpu number. Hardware threads are assigned to vcpus
as they become runnable, so we will always use the lower-numbered
hardware threads in preference to higher-numbered threads if not all
the vcpus in the vcore are runnable, regardless of which vcpus are
runnable.

Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

Paul Mackerras
2011-07-12 18:16:57 +0800
54738c097 KVM: PPC: Accelerate H_PUT_TCE by implementing it in real mode ... Browse Code »

This improves I/O performance for guests using the PAPR
paravirtualization interface by making the H_PUT_TCE hcall faster, by
implementing it in real mode. H_PUT_TCE is used for updating virtual
IOMMU tables, and is used both for virtual I/O and for real I/O in the
PAPR interface.

Since this moves the IOMMU tables into the kernel, we define a new
KVM_CREATE_SPAPR_TCE ioctl to allow qemu to create the tables. The
ioctl returns a file descriptor which can be used to mmap the newly
created table. The qemu driver models use them in the same way as
userspace managed tables, but they can be updated directly by the
guest with a real-mode H_PUT_TCE implementation, reducing the number
of host/guest context switches during guest IO.

There are certain circumstances where it is useful for userland qemu
to write to the TCE table even if the kernel H_PUT_TCE path is used
most of the time. Specifically, allowing this will avoid awkwardness
when we need to reset the table. More importantly, we will in the
future need to write the table in order to restore its state after a
checkpoint resume or migration.

Signed-off-by: David Gibson
Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

David Gibson
2011-07-12 18:16:56 +0800
a8606e20e KVM: PPC: Handle some PAPR hcalls in the kernel ... Browse Code »

This adds the infrastructure for handling PAPR hcalls in the kernel,
either early in the guest exit path while we are still in real mode,
or later once the MMU has been turned back on and we are in the full
kernel context. The advantage of handling hcalls in real mode if
possible is that we avoid two partition switches -- and this will
become more important when we support SMT4 guests, since a partition
switch means we have to pull all of the threads in the core out of
the guest. The disadvantage is that we can only access the kernel
linear mapping, not anything vmalloced or ioremapped, since the MMU
is off.

This also adds code to handle the following hcalls in real mode:

H_ENTER Add an HPTE to the hashed page table
H_REMOVE Remove an HPTE from the hashed page table
H_READ Read HPTEs from the hashed page table
H_PROTECT Change the protection bits in an HPTE
H_BULK_REMOVE Remove up to 4 HPTEs from the hashed page table
H_SET_DABR Set the data address breakpoint register

Plus code to handle the following hcalls in the kernel:

H_CEDE Idle the vcpu until an interrupt or H_PROD hcall arrives
H_PROD Wake up a ceded vcpu
H_REGISTER_VPA Register a virtual processor area (VPA)

The code that runs in real mode has to be in the base kernel, not in
the module, if KVM is compiled as a module. The real-mode code can
only access the kernel linear mapping, not vmalloc or ioremap space.

Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

Paul Mackerras
2011-07-12 18:16:55 +0800
de56a948b KVM: PPC: Add support for Book3S processors in hypervisor mode ... Browse Code »

This adds support for KVM running on 64-bit Book 3S processors,
specifically POWER7, in hypervisor mode. Using hypervisor mode means
that the guest can use the processor's supervisor mode. That means
that the guest can execute privileged instructions and access privileged
registers itself without trapping to the host. This gives excellent
performance, but does mean that KVM cannot emulate a processor
architecture other than the one that the hardware implements.

This code assumes that the guest is running paravirtualized using the
PAPR (Power Architecture Platform Requirements) interface, which is the
interface that IBM's PowerVM hypervisor uses. That means that existing
Linux distributions that run on IBM pSeries machines will also run
under KVM without modification. In order to communicate the PAPR
hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
to include/linux/kvm.h.

Currently the choice between book3s_hv support and book3s_pr support
(i.e. the existing code, which runs the guest in user mode) has to be
made at kernel configuration time, so a given kernel binary can only
do one or the other.

This new book3s_hv code doesn't support MMIO emulation at present.
Since we are running paravirtualized guests, this isn't a serious
restriction.

With the guest running in supervisor mode, most exceptions go straight
to the guest. We will never get data or instruction storage or segment
interrupts, alignment interrupts, decrementer interrupts, program
interrupts, single-step interrupts, etc., coming to the hypervisor from
the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
exception entry path so that we don't have to do the KVM test on entry
to those exception handlers.

We do however get hypervisor decrementer, hypervisor data storage,
hypervisor instruction storage, and hypervisor emulation assist
interrupts, so we have to handle those.

In hypervisor mode, real-mode accesses can access all of RAM, not just
a limited amount. Therefore we put all the guest state in the vcpu.arch
and use the shadow_vcpu in the PACA only for temporary scratch space.
We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
We don't have a shared page with the guest, but we still need a
kvm_vcpu_arch_shared struct to store the values of various registers,
so we include one in the vcpu_arch struct.

The POWER7 processor has a restriction that all threads in a core have
to be in the same partition. MMU-on kernel code counts as a partition
(partition 0), so we have to do a partition switch on every entry to and
exit from the guest. At present we require the host and guest to run
in single-thread mode because of this hardware restriction.

This code allocates a hashed page table for the guest and initializes
it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
require that the guest memory is allocated using 16MB huge pages, in
order to simplify the low-level memory management. This also means that
we can get away without tracking paging activity in the host for now,
since huge pages can't be paged or swapped.

This also adds a few new exports needed by the book3s_hv code.

Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

Paul Mackerras
2011-07-12 18:16:54 +0800
3c42bf8a7 KVM: PPC: Split host-state fields out of kvmppc_book3s_shadow_vcpu ... Browse Code »

There are several fields in struct kvmppc_book3s_shadow_vcpu that
temporarily store bits of host state while a guest is running,
rather than anything relating to the particular guest or vcpu.
This splits them out into a new kvmppc_host_state structure and
modifies the definitions in asm-offsets.c to suit.

On 32-bit, we have a kvmppc_host_state structure inside the
kvmppc_book3s_shadow_vcpu since the assembly code needs to be able
to get to them both with one pointer. On 64-bit they are separate
fields in the PACA. This means that on 64-bit we don't need to
copy the kvmppc_host_state in and out on vcpu load/unload, and
in future will mean that the book3s_hv code doesn't need a
shadow_vcpu struct in the PACA at all. That does mean that we
have to be careful not to rely on any values persisting in the
hstate field of the paca across any point where we could block
or get preempted.

Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

Paul Mackerras
2011-07-12 18:16:53 +0800
923c53cae powerpc: Set up LPCR for running guest partitions ... Browse Code »

In hypervisor mode, the LPCR controls several aspects of guest
partitions, including virtual partition memory mode, and also controls
whether the hypervisor decrementer interrupts are enabled. This sets
up LPCR at boot time so that guest partitions will use a virtual real
memory area (VRMA) composed of 16MB large pages, and hypervisor
decrementer interrupts are disabled.

Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

Paul Mackerras
2011-07-12 18:16:52 +0800
df6909e5d KVM: PPC: Move guest enter/exit down into subarch-specific code ... Browse Code »

Instead of doing the kvm_guest_enter/exit() and local_irq_dis/enable()
calls in powerpc.c, this moves them down into the subarch-specific
book3s_pr.c and booke.c. This eliminates an extra local_irq_enable()
call in book3s_pr.c, and will be needed for when we do SMT4 guest
support in the book3s hypervisor mode code.

Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

Paul Mackerras
2011-07-12 18:16:51 +0800
f9e0554de KVM: PPC: Pass init/destroy vm and prepare/commit memory region ops down ... Browse Code »

This arranges for the top-level arch/powerpc/kvm/powerpc.c file to
pass down some of the calls it gets to the lower-level subarchitecture
specific code. The lower-level implementations (in booke.c and book3s.c)
are no-ops. The coming book3s_hv.c will need this.

Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

Paul Mackerras
2011-07-12 18:16:50 +0800
3cf658b60 KVM: PPC: Deliver program interrupts right away instead of queueing them ... Browse Code »

Doing so means that we don't have to save the flags anywhere and gets
rid of the last reference to to_book3s(vcpu) in arch/powerpc/kvm/book3s.c.

Doing so is OK because a program interrupt won't be generated at the
same time as any other synchronous interrupt. If a program interrupt
and an asynchronous interrupt (external or decrementer) are generated
at the same time, the program interrupt will be delivered, which is
correct because it has a higher priority, and then the asynchronous
interrupt will be masked.

We don't ever generate system reset or machine check interrupts to the
guest, but if we did, then we would need to make sure they got delivered
rather than the program interrupt. The current code would be wrong in
this situation anyway since it would deliver the program interrupt as
well as the reset/machine check interrupt.

Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

Paul Mackerras
2011-07-12 18:16:49 +0800
b01c8b54a powerpc, KVM: Rework KVM checks in first-level interrupt handlers ... Browse Code »

Instead of branching out-of-line with the DO_KVM macro to check if we
are in a KVM guest at the time of an interrupt, this moves the KVM
check inline in the first-level interrupt handlers. This speeds up
the non-KVM case and makes sure that none of the interrupt handlers
are missing the check.

Because the first-level interrupt handlers are now larger, some things
had to be move out of line in exceptions-64s.S.

This all necessitated some minor changes to the interrupt entry code
in KVM. This also streamlines the book3s_32 KVM test.

Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

Paul Mackerras
2011-07-12 18:16:48 +0800
f05ed4d56 KVM: PPC: Split out code from book3s.c into book3s_pr.c ... Browse Code »

In preparation for adding code to enable KVM to use hypervisor mode
on 64-bit Book 3S processors, this splits book3s.c into two files,
book3s.c and book3s_pr.c, where book3s_pr.c contains the code that is
specific to running the guest in problem state (user mode) and book3s.c
contains code which should apply to all Book 3S processors.

In doing this, we abstract some details, namely the interrupt offset,
updating the interrupt pending flag, and detecting if the guest is
in a critical section. These are all things that will be different
when we use hypervisor mode.

Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

Paul Mackerras
2011-07-12 18:16:47 +0800
c4befc58a KVM: PPC: Move fields between struct kvm_vcpu_arch and kvmppc_vcpu_book3s ... Browse Code »

This moves the slb field, which represents the state of the emulated
SLB, from the kvmppc_vcpu_book3s struct to the kvm_vcpu_arch, and the
hpte_hash_[v]pte[_long] fields from kvm_vcpu_arch to kvmppc_vcpu_book3s.
This is in accord with the principle that the kvm_vcpu_arch struct
represents the state of the emulated CPU, and the kvmppc_vcpu_book3s
struct holds the auxiliary data structures used in the emulation.

Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

Paul Mackerras
2011-07-12 18:16:46 +0800
149dbdb18 KVM: PPC: Fix machine checks on 32-bit Book3S ... Browse Code »

Commit 69acc0d3ba ("KVM: PPC: Resolve real-mode handlers through
function exports") resulted in vcpu->arch.trampoline_lowmem and
vcpu->arch.trampoline_enter ending up with kernel virtual addresses
rather than physical addresses. This is OK on 64-bit Book3S machines,
which ignore the top 4 bits of the effective address in real mode,
but on 32-bit Book3S machines, accessing these addresses in real mode
causes machine check interrupts, as the hardware uses the whole
effective address as the physical address in real mode.

This fixes the problem by using __pa() to convert these addresses
to physical addresses.

Signed-off-by: Paul Mackerras
Signed-off-by: Alexander Graf

Paul Mackerras
2011-07-12 18:16:45 +0800
3c8c652ae KVM: MMU: Introduce is_last_gpte() to clean up walk_addr_generic() ... Browse Code »

Suggested by Ingo and Avi.

Cc: Ingo Molnar
Signed-off-by: Takuya Yoshikawa
Signed-off-by: Marcelo Tosatti

Takuya Yoshikawa
2011-07-12 18:16:44 +0800
92c1c1e85 KVM: MMU: Rename the walk label in walk_addr_generic() ... Browse Code »

The current name does not explain the meaning well. So give it a better
name "retry_walk" to show that we are trying the walk again.

This was suggested by Ingo Molnar.

Cc: Ingo Molnar
Signed-off-by: Takuya Yoshikawa
Signed-off-by: Marcelo Tosatti

Takuya Yoshikawa
2011-07-12 18:16:43 +0800
134291bf3 KVM: MMU: Clean up the error handling of walk_addr_generic() ... Browse Code »

Avoid two step jump to the error handling part. This eliminates the use
of the variables present and rsvd_fault.

We also use the const type qualifier to show that write/user/fetch_fault
do not change in the function.

Both of these were suggested by Ingo Molnar.

Cc: Ingo Molnar
Signed-off-by: Takuya Yoshikawa
Signed-off-by: Marcelo Tosatti

Takuya Yoshikawa
2011-07-12 18:16:42 +0800
f8f7e5ee1 Revert "KVM: MMU: make kvm_mmu_reset_context() flush the guest TLB" ... Browse Code »

This reverts commit bee931d31e588b8eb86b7edee32fac2d16930cd7.

TLB flush should be done lazily during guest entry, in
kvm_mmu_load().

Signed-off-by: Marcelo Tosatti

Marcelo Tosatti
2011-07-12 18:16:41 +0800
1aee47a02 KVM: PPC: e500: Don't search over the entire TLB0. ... Browse Code »

Only look in the 4 entries that could possibly contain the
entry we're looking for.

Signed-off-by: Scott Wood
Signed-off-by: Alexander Graf

Scott Wood
2011-07-12 18:16:40 +0800
dd9ebf1f9 KVM: PPC: e500: Add shadow PID support ... Browse Code »

Dynamically assign host PIDs to guest PIDs, splitting each guest PID into
multiple host (shadow) PIDs based on kernel/user and MSR[IS/DS]. Use
both PID0 and PID1 so that the shadow PIDs for the right mode can be
selected, that correspond both to guest TID = zero and guest TID = guest
PID.

This allows us to significantly reduce the frequency of needing to
invalidate the entire TLB. When the guest mode or PID changes, we just
update the host PID0/PID1. And since the allocation of shadow PIDs is
global, multiple guests can share the TLB without conflict.

Note that KVM does not yet support the guest setting PID1 or PID2 to
a value other than zero. This will need to be fixed for nested KVM
to work. Until then, we enforce the requirement for guest PID1/PID2
to stay zero by failing the emulation if the guest tries to set them
to something else.

Signed-off-by: Liu Yu
Signed-off-by: Scott Wood
Signed-off-by: Alexander Graf

Liu Yu
2011-07-12 18:16:39 +0800
08b7fa92b KVM: PPC: e500: Stop keeping shadow TLB ... Browse Code »

Instead of a fully separate set of TLB entries, keep just the
pfn and dirty status.

Signed-off-by: Liu Yu
Signed-off-by: Scott Wood
Signed-off-by: Alexander Graf

Liu Yu
2011-07-12 18:16:38 +0800
a4cd8b23a KVM: PPC: e500: enable magic page ... Browse Code »

This is a shared page used for paravirtualization. It is always present
in the guest kernel's effective address space at the address indicated
by the hypercall that enables it.

The physical address specified by the hypercall is not used, as
e500 does not have real mode.

Signed-off-by: Scott Wood
Signed-off-by: Alexander Graf

Scott Wood
2011-07-12 18:16:37 +0800
9973d54ee KVM: PPC: e500: Support large page mappings of PFNMAP vmas. ... Browse Code »

This allows large pages to be used on guest mappings backed by things like
/dev/mem, resulting in a significant speedup when guest memory
is mapped this way (it's useful for directly-assigned MMIO, too).

This is not a substitute for hugetlbfs integration, but is useful for
configurations where devices are directly assigned on chips without an
IOMMU -- in these cases, we need guest physical and true physical to
match, and be contiguous, so static reservation and mapping via /dev/mem
is the most straightforward way to set things up.

Signed-off-by: Scott Wood
Signed-off-by: Alexander Graf

Scott Wood
2011-07-12 18:16:36 +0800
59c1f4e35 KVM: PPC: e500: Eliminate shadow_pages[], and use pfns instead. ... Browse Code »

This is in line with what other architectures do, and will allow us to
map things other than ordinary, unreserved kernel pages -- such as
dedicated devices, or large contiguous reserved regions.

Signed-off-by: Scott Wood
Signed-off-by: Alexander Graf

Scott Wood
2011-07-12 18:16:35 +0800
0ef309956 KVM: PPC: e500: don't use MAS0 as intermediate storage. ... Browse Code »

This avoids races. It also means that we use the shadow TLB way,
rather than the hardware hint -- if this is a problem, we could do
a tlbsx before inserting a TLB0 entry.

Signed-off-by: Scott Wood
Signed-off-by: Alexander Graf

Scott Wood
2011-07-12 18:16:34 +0800
6fc4d1eb9 KVM: PPC: e500: Disable preloading TLB1 in tlb_load(). ... Browse Code »

Since TLB1 loading doesn't check the shadow TLB before allocating another
entry, you can get duplicates.

Once shadow PIDs are enabled in a later patch, we won't need to
invalidate the TLB on every switch, so this optimization won't be
needed anyway.

Signed-off-by: Scott Wood
Signed-off-by: Alexander Graf

Scott Wood
2011-07-12 18:16:33 +0800
4cd35f675 KVM: PPC: e500: Save/restore SPE state ... Browse Code »

This is done lazily. The SPE save will be done only if the guest has
used SPE since the last preemption or heavyweight exit. Restore will be
done only on demand, when enabling MSR_SPE in the shadow MSR, in response
to an SPE fault or mtmsr emulation.

For SPEFSCR, Linux already switches it on context switch (non-lazily), so
the only remaining bit is to save it between qemu and the guest.

Signed-off-by: Liu Yu
Signed-off-by: Scott Wood
Signed-off-by: Alexander Graf

Scott Wood
2011-07-12 18:16:32 +0800
ecee273fc KVM: PPC: booke: use shadow_msr ... Browse Code »

Keep the guest MSR and the guest-mode true MSR separate, rather than
modifying the guest MSR on each guest entry to produce a true MSR.

Any bits which should be modified based on guest MSR must be explicitly
propagated from vcpu->arch.shared->msr to vcpu->arch.shadow_msr in
kvmppc_set_msr().

While we're modifying the guest entry code, reorder a few instructions
to bury some load latencies.

Signed-off-by: Scott Wood
Signed-off-by: Alexander Graf

Scott Wood
2011-07-12 18:16:32 +0800
c51584d52 powerpc/e500: SPE register saving: take arbitrary struct offset ... Browse Code »

Previously, these macros hardcoded THREAD_EVR0 as the base of the save
area, relative to the base register passed. This base offset is now
passed as a separate macro parameter, allowing reuse with other SPE
save areas, such as used by KVM.

Acked-by: Kumar Gala
Signed-off-by: Scott Wood
Signed-off-by: Alexander Graf

Scott Wood
2011-07-12 18:16:31 +0800
685659ee7 powerpc/e500: Save SPEFCSR in flush_spe_to_thread() ... Browse Code »

giveup_spe() saves the SPE state which is protected by MSR[SPE].
However, modifying SPEFSCR does not trap when MSR[SPE]=0.
And since SPEFSCR is already saved/restored in _switch(),
not all the callers want to save SPEFSCR again.
Thus, saving SPEFSCR should not belong to giveup_spe().

This patch moves SPEFSCR saving to flush_spe_to_thread(),
and cleans up the caller that needs to save SPEFSCR accordingly.

Signed-off-by: Liu Yu
Acked-by: Kumar Gala
Signed-off-by: Scott Wood
Signed-off-by: Alexander Graf

yu liu
2011-07-12 18:16:30 +0800
a22a2dacc KVM: PPC: Resolve real-mode handlers through function exports ... Browse Code »

Up until now, Book3S KVM had variables stored in the kernel that a kernel module
or the kvm code in the kernel could read from to figure out where some real mode
helper functions are located.

This is all unnecessary. The high bits of the EA get ignore in real mode, so we
can just use the pointer as is. Also, it's a lot easier on relocations when we
use the normal way of resolving the address to a function, instead of jumping
through hoops.

This patch fixes compilation with CONFIG_RELOCATABLE=y.

Signed-off-by: Alexander Graf

Alexander Graf
2011-07-12 18:16:29 +0800
24294b9a3 KVM: PPC: fix partial application of "exit timing in ticks" ... Browse Code »

When http://www.spinics.net/lists/kvm-ppc/msg02664.html
was applied to produce commit b51e7aa7ed6d8d134d02df78300ab0f91cfff4d2,
the removal of the conversion in add_exit_timing was left out.

Signed-off-by: Stuart Yoder
Signed-off-by: Scott Wood
Signed-off-by: Alexander Graf

Stuart Yoder
2011-07-12 18:16:28 +0800
45bd07b9d KVM: MMU: make kvm_mmu_reset_context() flush the guest TLB ... Browse Code »

kvm_set_cr0() and kvm_set_cr4(), and possible other functions,
assume that kvm_mmu_reset_context() flushes the guest TLB. However,
it does not.

Fix by flushing the tlb (and syncing the new root as well).

Signed-off-by: Avi Kivity

Avi Kivity
2011-07-12 18:16:27 +0800