11 Jan, 2012

1 commit

  • * 'kvm-updates/3.3' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (74 commits)
    KVM: PPC: Whitespace fix for kvm.h
    KVM: Fix whitespace in kvm_para.h
    KVM: PPC: annotate kvm_rma_init as __init
    KVM: x86 emulator: implement RDPMC (0F 33)
    KVM: x86 emulator: fix RDPMC privilege check
    KVM: Expose the architectural performance monitoring CPUID leaf
    KVM: VMX: Intercept RDPMC
    KVM: SVM: Intercept RDPMC
    KVM: Add generic RDPMC support
    KVM: Expose a version 2 architectural PMU to a guests
    KVM: Expose kvm_lapic_local_deliver()
    KVM: x86 emulator: Use opcode::execute for Group 9 instruction
    KVM: x86 emulator: Use opcode::execute for Group 4/5 instructions
    KVM: x86 emulator: Use opcode::execute for Group 1A instruction
    KVM: ensure that debugfs entries have been created
    KVM: drop bsp_vcpu pointer from kvm struct
    KVM: x86: Consolidate PIT legacy test
    KVM: x86: Do not rely on implicit inclusions
    KVM: Make KVM_INTEL depend on CPU_SUP_INTEL
    KVM: Use memdup_user instead of kmalloc/copy_from_user
    ...

    Linus Torvalds
     

07 Jan, 2012

1 commit

  • * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (185 commits)
    powerpc: fix compile error with 85xx/p1010rdb.c
    powerpc: fix compile error with 85xx/p1023_rds.c
    powerpc/fsl: add MSI support for the Freescale hypervisor
    arch/powerpc/sysdev/fsl_rmu.c: introduce missing kfree
    powerpc/fsl: Add support for Integrated Flash Controller
    powerpc/fsl: update compatiable on fsl 16550 uart nodes
    powerpc/85xx: fix PCI and localbus properties in p1022ds.dts
    powerpc/85xx: re-enable ePAPR byte channel driver in corenet32_smp_defconfig
    powerpc/fsl: Update defconfigs to enable some standard FSL HW features
    powerpc: Add TBI PHY node to first MDIO bus
    sbc834x: put full compat string in board match check
    powerpc/fsl-pci: Allow 64-bit PCIe devices to DMA to any memory address
    powerpc: Fix unpaired probe_hcall_entry and probe_hcall_exit
    offb: Fix setting of the pseudo-palette for >8bpp
    offb: Add palette hack for qemu "standard vga" framebuffer
    offb: Fix bug in calculating requested vram size
    powerpc/boot: Change the WARN to INFO for boot wrapper overlap message
    powerpc/44x: Fix build error on currituck platform
    powerpc/boot: Change the load address for the wrapper to fit the kernel
    powerpc/44x: Enable CRASH_DUMP for 440x
    ...

    Fix up a trivial conflict in arch/powerpc/include/asm/cputime.h due to
    the additional sparse-checking code for cputime_t.

    Linus Torvalds
     

27 Dec, 2011

2 commits


26 Dec, 2011

3 commits

  • This is required for THIS_MODULE. We recently stopped acquiring
    it via some other header.

    Signed-off-by: Scott Wood
    Signed-off-by: Alexander Graf

    Scott Wood
     
  • Currently kvmppc_start_thread() tries to wake other SMT threads via
    xics_wake_cpu(). Unfortunately xics_wake_cpu only exists when
    CONFIG_SMP=Y so when compiling with CONFIG_SMP=N we get:

    arch/powerpc/kvm/built-in.o: In function `.kvmppc_start_thread':
    book3s_hv.c:(.text+0xa1e0): undefined reference to `.xics_wake_cpu'

    The following should be fine since kvmppc_start_thread() shouldn't
    called to start non-zero threads when SMP=N since threads_per_core=1.

    Signed-off-by: Michael Neuling
    Signed-off-by: Alexander Graf

    Michael Neuling
     
  • kvmppc_h_pr is only available if CONFIG_KVM_BOOK3S_64_PR.

    Signed-off-by: Andreas Schwab
    Signed-off-by: Alexander Graf

    Andreas Schwab
     

08 Dec, 2011

1 commit

  • This fixes a problem where a CPU thread coming out of nap mode can
    think it has valid values in the nonvolatile GPRs (r14 - r31) as saved
    away in power7_idle, but in fact the values have been trashed because
    the thread was used for KVM in the mean time. The result is that the
    thread crashes because code that called power7_idle (e.g.,
    pnv_smp_cpu_kill_self()) goes to use values in registers that have
    been trashed.

    The bit field in SRR1 that tells whether state was lost only reflects
    the most recent nap, which may not have been the nap instruction in
    power7_idle. So we need an extra PACA field to indicate that state
    has been lost even if SRR1 indicates that the most recent nap didn't
    lose state. We clear this field when saving the state in power7_idle,
    we set it to a non-zero value when we use the thread for KVM, and we
    test it in power7_wakeup_noloss.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt

    Paul Mackerras
     

21 Nov, 2011

1 commit

  • * 'kvm-updates/3.2' of git://git.kernel.org/pub/scm/virt/kvm/kvm:
    KVM guest: prevent tracing recursion with kvmclock
    Revert "KVM: PPC: Add support for explicit HIOR setting"
    KVM: VMX: Check for automatic switch msr table overflow
    KVM: VMX: Add support for guest/host-only profiling
    KVM: VMX: add support for switching of PERF_GLOBAL_CTRL
    KVM: s390: announce SYNC_MMU
    KVM: s390: Fix tprot locking
    KVM: s390: handle SIGP sense running intercepts
    KVM: s390: Fix RUNNING flag misinterpretation

    Linus Torvalds
     

17 Nov, 2011

1 commit


16 Nov, 2011

1 commit

  • If you build with KVM and UP it fails with the following due to a
    missing include.

    /arch/powerpc/kvm/book3s_hv.c: In function 'do_h_register_vpa':
    arch/powerpc/kvm/book3s_hv.c:156:10: error: 'H_PARAMETER' undeclared (first use in this function)
    arch/powerpc/kvm/book3s_hv.c:156:10: note: each undeclared identifier is reported only once for each function it appears in
    arch/powerpc/kvm/book3s_hv.c:192:12: error: 'H_RESOURCE' undeclared (first use in this function)
    arch/powerpc/kvm/book3s_hv.c:222:9: error: 'H_SUCCESS' undeclared (first use in this function)
    arch/powerpc/kvm/book3s_hv.c: In function 'kvmppc_pseries_do_hcall':
    arch/powerpc/kvm/book3s_hv.c:228:30: error: 'H_SUCCESS' undeclared (first use in this function)
    arch/powerpc/kvm/book3s_hv.c:232:7: error: 'H_CEDE' undeclared (first use in this function)
    arch/powerpc/kvm/book3s_hv.c:234:7: error: 'H_PROD' undeclared (first use in this function)
    arch/powerpc/kvm/book3s_hv.c:238:10: error: 'H_PARAMETER' undeclared (first use in this function)
    arch/powerpc/kvm/book3s_hv.c:250:7: error: 'H_CONFER' undeclared (first use in this function)
    arch/powerpc/kvm/book3s_hv.c:252:7: error: 'H_REGISTER_VPA' undeclared (first use in this function)
    make[2]: *** [arch/powerpc/kvm/book3s_hv.o] Error 1

    Signed-off-by: Michael Neuling
    cc: stable@kernel.org (3.1 only)
    Signed-off-by: Benjamin Herrenschmidt

    Michael Neuling
     

08 Nov, 2011

1 commit

  • Fix KVM build for older toolchains (found with .powerpc64-unknown-linux-gnu-gcc
    (crosstool-NG-1.8.1) 4.3.2):

    AS arch/powerpc/kvm/book3s_hv_rmhandlers.o
    arch/powerpc/kvm/book3s_hv_rmhandlers.S: Assembler messages:
    arch/powerpc/kvm/book3s_hv_rmhandlers.S:1388: Error: Unrecognized opcode: `popcntw'
    make[1]: *** [arch/powerpc/kvm/book3s_hv_rmhandlers.o] Error 1
    make: *** [_module_arch/powerpc/kvm] Error 2

    Signed-off-by: Nishanth Aravamudan
    Signed-off-by: Benjamin Herrenschmidt

    Nishanth Aravamudan
     

01 Nov, 2011

4 commits

  • None of the files touched here are modules, and they are not
    exporting any symbols either -- so there is no need to be including
    the module.h. Builds of all the files remains successful.

    Even kernel/module.c does not need to include it, since it includes
    linux/moduleloader.h instead.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • All these files were including module.h just for the basic
    EXPORT_SYMBOL infrastructure. We can shift them off to the
    export.h header which is a way smaller footprint and thus
    realize some compile time gains.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • Fix failures in powerpc associated with the previously allowed
    implicit module.h presence that now lead to things like this:

    arch/powerpc/mm/mmu_context_hash32.c:76:1: error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL_GPL'
    arch/powerpc/mm/tlb_hash32.c:48:1: error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL'
    arch/powerpc/kernel/pci_32.c:51:1: error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL_GPL'
    arch/powerpc/kernel/iomap.c:36:1: error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL'
    arch/powerpc/platforms/44x/canyonlands.c:126:1: error: type defaults to 'int' in declaration of 'EXPORT_SYMBOL'
    arch/powerpc/kvm/44x.c:168:59: error: 'THIS_MODULE' undeclared (first use in this function)

    [with several contibutions from Stephen Rothwell ]

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     
  • With module.h being implicitly everywhere via device.h, the absence
    of explicitly including something for EXPORT_SYMBOL went unnoticed.
    Since we are heading to fix things up and clean module.h from the
    device.h file, we need to explicitly include these files now.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

26 Sep, 2011

13 commits

  • With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
    core), whenever a CPU goes idle, we have to pull all the other
    hardware threads in the core out of the guest, because the H_CEDE
    hcall is handled in the kernel. This is inefficient.

    This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
    in real mode. When a guest vcpu does an H_CEDE hcall, we now only
    exit to the kernel if all the other vcpus in the same core are also
    idle. Otherwise we mark this vcpu as napping, save state that could
    be lost in nap mode (mainly GPRs and FPRs), and execute the nap
    instruction. When the thread wakes up, because of a decrementer or
    external interrupt, we come back in at kvm_start_guest (from the
    system reset interrupt vector), find the `napping' flag set in the
    paca, and go to the resume path.

    This has some other ramifications. First, when starting a core, we
    now start all the threads, both those that are immediately runnable and
    those that are idle. This is so that we don't have to pull all the
    threads out of the guest when an idle thread gets a decrementer interrupt
    and wants to start running. In fact the idle threads will all start
    with the H_CEDE hcall returning; being idle they will just do another
    H_CEDE immediately and go to nap mode.

    This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
    These functions have been restructured to make them simpler and clearer.
    We introduce a level of indirection in the wait queue that gets woken
    when external and decrementer interrupts get generated for a vcpu, so
    that we can have the 4 vcpus in a vcore using the same wait queue.
    We need this because the 4 vcpus are being handled by one thread.

    Secondly, when we need to exit from the guest to the kernel, we now
    have to generate an IPI for any napping threads, because an HDEC
    interrupt doesn't wake up a napping thread.

    Thirdly, we now need to be able to handle virtual external interrupts
    and decrementer interrupts becoming pending while a thread is napping,
    and deliver those interrupts to the guest when the thread wakes.
    This is done in kvmppc_cede_reentry, just before fast_guest_return.

    Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
    and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
    from kvm_arch_vcpu_runnable.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This simplifies the way that the book3s_pr makes the transition to
    real mode when entering the guest. We now call kvmppc_entry_trampoline
    (renamed from kvmppc_rmcall) in the base kernel using a normal function
    call instead of doing an indirect call through a pointer in the vcpu.
    If kvm is a module, the module loader takes care of generating a
    trampoline as it does for other calls to functions outside the module.

    kvmppc_entry_trampoline then disables interrupts and jumps to
    kvmppc_handler_trampoline_enter in real mode using an rfi[d].
    That then uses the link register as the address to return to
    (potentially in module space) when the guest exits.

    This also simplifies the way that we call the Linux interrupt handler
    when we exit the guest due to an external, decrementer or performance
    monitor interrupt. Instead of turning on the MMU, then deciding that
    we need to call the Linux handler and turning the MMU back off again,
    we now go straight to the handler at the point where we would turn the
    MMU on. The handler will then return to the virtual-mode code
    (potentially in the module).

    Along the way, this moves the setting and clearing of the HID5 DCBZ32
    bit into real-mode interrupts-off code, and also makes sure that
    we clear the MSR[RI] bit before loading values into SRR0/1.

    The net result is that we no longer need any code addresses to be
    stored in vcpu->arch.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This makes arch/powerpc/kvm/book3s_rmhandlers.S and
    arch/powerpc/kvm/book3s_hv_rmhandlers.S be assembled as
    separate compilation units rather than having them #included in
    arch/powerpc/kernel/exceptions-64s.S. We no longer have any
    conditional branches between the exception prologs in
    exceptions-64s.S and the KVM handlers, so there is no need to
    keep their contents close together in the vmlinux image.

    In their current location, they are using up part of the limited
    space between the first-level interrupt handlers and the firmware
    NMI data area at offset 0x7000, and with some kernel configurations
    this area will overflow (e.g. allyesconfig), leading to an
    "attempt to .org backwards" error when compiling exceptions-64s.S.

    Moving them out requires that we add some #includes that the
    book3s_{,hv_}rmhandlers.S code was previously getting implicitly
    via exceptions-64s.S.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • There are multiple features in PowerPC KVM that can now be enabled
    depending on the user's wishes. Some of the combinations don't make
    sense or don't work though.

    So this patch adds a way to check if the executing environment would
    actually be able to run the guest properly. It also adds sanity
    checks if PVR is set (should always be true given the current code
    flow), if PAPR is only used with book3s_64 where it works and that
    HV KVM is only used in PAPR mode.

    Signed-off-by: Alexander Graf

    Alexander Graf
     
  • Now that Book3S PV mode can also run PAPR guests, we can add a PAPR cap and
    enable it for all Book3S targets. Enabling that CAP switches KVM into PAPR
    mode.

    Signed-off-by: Alexander Graf

    Alexander Graf
     
  • PAPR defines hypercalls as SC1 instructions. Using these, the guest modifies
    page tables and does other privileged operations that it wouldn't be allowed
    to do in supervisor mode.

    This patch adds support for PR KVM to trap these instructions and route them
    through the same PAPR hypercall interface that we already use for HV style
    KVM.

    Signed-off-by: Alexander Graf

    Alexander Graf
     
  • Recent Linux versions use the CFAR and PURR SPRs, but don't really care about
    their contents (yet). So for now, we can simply return 0 when the guest wants
    to read them.

    Signed-off-by: Alexander Graf

    Alexander Graf
     
  • When running a PAPR guest, we need to handle a few hypercalls in kernel space,
    most prominently the page table invalidation (to sync the shadows).

    So this patch adds handling for a few PAPR hypercalls to PR mode KVM. I tried
    to share the code with HV mode, but it ended up being a lot easier this way
    around, as the two differ too much in those details.

    Signed-off-by: Alexander Graf

    ---

    v1 -> v2:

    - whitespace fix

    Alexander Graf
     
  • Until now, we always set HIOR based on the PVR, but this is just wrong.
    Instead, we should be setting HIOR explicitly, so user space can decide
    what the initial HIOR value is - just like on real hardware.

    We keep the old PVR based way around for backwards compatibility, but
    once user space uses the SREGS based method, we drop the PVR logic.

    Signed-off-by: Alexander Graf

    Alexander Graf
     
  • We have a few traps where we cache the instruction that cause the trap
    for analysis later on. Since we now need to be able to distinguish
    between SC 0 and SC 1 system calls and the only way to find out which
    is which is by looking at the instruction, we also read out the instruction
    causing the system call.

    Signed-off-by: Alexander Graf

    Alexander Graf
     
  • When running a PAPR guest, the guest is not allowed to set SDR1 - instead
    the HTAB information is held in internal hypervisor structures. But all of
    our current code relies on SDR1 and walking the HTAB like on real hardware.

    So in order to not be too intrusive, we simply set SDR1 to the HTAB we hold
    in host memory. That way we can keep the HTAB in user space, but use it from
    kernel space to map the guest.

    Signed-off-by: Alexander Graf

    Alexander Graf
     
  • We have 3 privilege levels: problem state, supervisor state and hypervisor
    state. Each of them can access different SPRs, so we need to check on every
    SPR if it's accessible in the respective mode.

    Signed-off-by: Alexander Graf

    Alexander Graf
     
  • We need the compute_tlbie_rb in _pr and _hv implementations for papr
    soon, so let's move it over to a common header file that both
    implementations can leverage.

    Signed-off-by: Alexander Graf

    Alexander Graf
     

05 Aug, 2011

1 commit


25 Jul, 2011

1 commit

  • * 'kvm-updates/3.1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (143 commits)
    KVM: IOMMU: Disable device assignment without interrupt remapping
    KVM: MMU: trace mmio page fault
    KVM: MMU: mmio page fault support
    KVM: MMU: reorganize struct kvm_shadow_walk_iterator
    KVM: MMU: lockless walking shadow page table
    KVM: MMU: do not need atomicly to set/clear spte
    KVM: MMU: introduce the rules to modify shadow page table
    KVM: MMU: abstract some functions to handle fault pfn
    KVM: MMU: filter out the mmio pfn from the fault pfn
    KVM: MMU: remove bypass_guest_pf
    KVM: MMU: split kvm_mmu_free_page
    KVM: MMU: count used shadow pages on prepareing path
    KVM: MMU: rename 'pt_write' to 'emulate'
    KVM: MMU: cleanup for FNAME(fetch)
    KVM: MMU: optimize to handle dirty bit
    KVM: MMU: cache mmio info on page fault path
    KVM: x86: introduce vcpu_mmio_gva_to_gpa to cleanup the code
    KVM: MMU: do not update slot bitmap if spte is nonpresent
    KVM: MMU: fix walking shadow page table
    KVM guest: KVM Steal time registration
    ...

    Linus Torvalds
     

23 Jul, 2011

1 commit

  • virtio has been so far used only in the context of virtualization,
    and the virtio Kconfig was sourced directly by the relevant arch
    Kconfigs when VIRTUALIZATION was selected.

    Now that we start using virtio for inter-processor communications,
    we need to source the virtio Kconfig outside of the virtualization
    scope too.

    Moreover, some architectures might use virtio for both virtualization
    and inter-processor communications, so directly sourcing virtio
    might yield unexpected results due to conflicting selections.

    The simple solution offered by this patch is to always source virtio's
    Kconfig in drivers/Kconfig, and remove it from the appropriate arch
    Kconfigs. Additionally, a virtio menu entry has been added so virtio
    drivers don't show up in the general drivers menu.

    This way anyone can use virtio, though it's arguably less accessible
    (and neat!) for virtualization users now.

    Note: some architectures (mips and sh) seem to have a VIRTUALIZATION
    menu merely for sourcing virtio's Kconfig, so that menu is removed too.

    Signed-off-by: Ohad Ben-Cohen
    Signed-off-by: Rusty Russell

    Ohad Ben-Cohen
     

12 Jul, 2011

8 commits

  • This adds support for running KVM guests in supervisor mode on those
    PPC970 processors that have a usable hypervisor mode. Unfortunately,
    Apple G5 machines have supervisor mode disabled (MSR[HV] is forced to
    1), but the YDL PowerStation does have a usable hypervisor mode.

    There are several differences between the PPC970 and POWER7 in how
    guests are managed. These differences are accommodated using the
    CPU_FTR_ARCH_201 (PPC970) and CPU_FTR_ARCH_206 (POWER7) CPU feature
    bits. Notably, on PPC970:

    * The LPCR, LPID or RMOR registers don't exist, and the functions of
    those registers are provided by bits in HID4 and one bit in HID0.

    * External interrupts can be directed to the hypervisor, but unlike
    POWER7 they are masked by MSR[EE] in non-hypervisor modes and use
    SRR0/1 not HSRR0/1.

    * There is no virtual RMA (VRMA) mode; the guest must use an RMO
    (real mode offset) area.

    * The TLB entries are not tagged with the LPID, so it is necessary to
    flush the whole TLB on partition switch. Furthermore, when switching
    partitions we have to ensure that no other CPU is executing the tlbie
    or tlbsync instructions in either the old or the new partition,
    otherwise undefined behaviour can occur.

    * The PMU has 8 counters (PMC registers) rather than 6.

    * The DSCR, PURR, SPURR, AMR, AMOR, UAMOR registers don't exist.

    * The SLB has 64 entries rather than 32.

    * There is no mediated external interrupt facility, so if we switch to
    a guest that has a virtual external interrupt pending but the guest
    has MSR[EE] = 0, we have to arrange to have an interrupt pending for
    it so that we can get control back once it re-enables interrupts. We
    do that by sending ourselves an IPI with smp_send_reschedule after
    hard-disabling interrupts.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This replaces the single CPU_FTR_HVMODE_206 bit with two bits, one to
    indicate that we have a usable hypervisor mode, and another to indicate
    that the processor conforms to PowerISA version 2.06. We also add
    another bit to indicate that the processor conforms to ISA version 2.01
    and set that for PPC970 and derivatives.

    Some PPC970 chips (specifically those in Apple machines) have a
    hypervisor mode in that MSR[HV] is always 1, but the hypervisor mode
    is not useful in the sense that there is no way to run any code in
    supervisor mode (HV=0 PR=0). On these processors, the LPES0 and LPES1
    bits in HID4 are always 0, and we use that as a way of detecting that
    hypervisor mode is not useful.

    Where we have a feature section in assembly code around code that
    only applies on POWER7 in hypervisor mode, we use a construct like

    END_FTR_SECTION_IFSET(CPU_FTR_HVMODE | CPU_FTR_ARCH_206)

    The definition of END_FTR_SECTION_IFSET is such that the code will
    be enabled (not overwritten with nops) only if all bits in the
    provided mask are set.

    Note that the CPU feature check in __tlbie() only needs to check the
    ARCH_206 bit, not the HVMODE bit, because __tlbie() can only get called
    if we are running bare-metal, i.e. in hypervisor mode.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This adds infrastructure which will be needed to allow book3s_hv KVM to
    run on older POWER processors, including PPC970, which don't support
    the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
    Offset (RMO) facility. These processors require a physically
    contiguous, aligned area of memory for each guest. When the guest does
    an access in real mode (MMU off), the address is compared against a
    limit value, and if it is lower, the address is ORed with an offset
    value (from the Real Mode Offset Register (RMOR)) and the result becomes
    the real address for the access. The size of the RMA has to be one of
    a set of supported values, which usually includes 64MB, 128MB, 256MB
    and some larger powers of 2.

    Since we are unlikely to be able to allocate 64MB or more of physically
    contiguous memory after the kernel has been running for a while, we
    allocate a pool of RMAs at boot time using the bootmem allocator. The
    size and number of the RMAs can be set using the kvm_rma_size=xx and
    kvm_rma_count=xx kernel command line options.

    KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
    of the pool of preallocated RMAs. The capability value is 1 if the
    processor can use an RMA but doesn't require one (because it supports
    the VRMA facility), or 2 if the processor requires an RMA for each guest.

    This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
    pool and returns a file descriptor which can be used to map the RMA. It
    also returns the size of the RMA in the argument structure.

    Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
    ioctl calls from userspace. To cope with this, we now preallocate the
    kvm->arch.ram_pginfo array when the VM is created with a size sufficient
    for up to 64GB of guest memory. Subsequently we will get rid of this
    array and use memory associated with each memslot instead.

    This moves most of the code that translates the user addresses into
    host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
    to kvmppc_core_prepare_memory_region. Also, instead of having to look
    up the VMA for each page in order to check the page size, we now check
    that the pages we get are compound pages of 16MB. However, if we are
    adding memory that is mapped to an RMA, we don't bother with calling
    get_user_pages_fast and instead just offset from the base pfn for the
    RMA.

    Typically the RMA gets added after vcpus are created, which makes it
    inconvenient to have the LPCR (logical partition control register) value
    in the vcpu->arch struct, since the LPCR controls whether the processor
    uses RMA or VRMA for the guest. This moves the LPCR value into the
    kvm->arch struct and arranges for the MER (mediated external request)
    bit, which is the only bit that varies between vcpus, to be set in
    assembly code when going into the guest if there is a pending external
    interrupt request.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This lifts the restriction that book3s_hv guests can only run one
    hardware thread per core, and allows them to use up to 4 threads
    per core on POWER7. The host still has to run single-threaded.

    This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
    capability. The return value of the ioctl querying this capability
    is the number of vcpus per virtual CPU core (vcore), currently 4.

    To use this, the host kernel should be booted with all threads
    active, and then all the secondary threads should be offlined.
    This will put the secondary threads into nap mode. KVM will then
    wake them from nap mode and use them for running guest code (while
    they are still offline). To wake the secondary threads, we send
    them an IPI using a new xics_wake_cpu() function, implemented in
    arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
    we assume that the platform has a XICS interrupt controller and
    we are using icp-native.c to drive it. Since the woken thread will
    need to acknowledge and clear the IPI, we also export the base
    physical address of the XICS registers using kvmppc_set_xics_phys()
    for use in the low-level KVM book3s code.

    When a vcpu is created, it is assigned to a virtual CPU core.
    The vcore number is obtained by dividing the vcpu number by the
    number of threads per core in the host. This number is exported
    to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
    to run the guest in single-threaded mode, it should make all vcpu
    numbers be multiples of the number of threads per core.

    We distinguish three states of a vcpu: runnable (i.e., ready to execute
    the guest), blocked (that is, idle), and busy in host. We currently
    implement a policy that the vcore can run only when all its threads
    are runnable or blocked. This way, if a vcpu needs to execute elsewhere
    in the kernel or in qemu, it can do so without being starved of CPU
    by the other vcpus.

    When a vcore starts to run, it executes in the context of one of the
    vcpu threads. The other vcpu threads all go to sleep and stay asleep
    until something happens requiring the vcpu thread to return to qemu,
    or to wake up to run the vcore (this can happen when another vcpu
    thread goes from busy in host state to blocked).

    It can happen that a vcpu goes from blocked to runnable state (e.g.
    because of an interrupt), and the vcore it belongs to is already
    running. In that case it can start to run immediately as long as
    the none of the vcpus in the vcore have started to exit the guest.
    We send the next free thread in the vcore an IPI to get it to start
    to execute the guest. It synchronizes with the other threads via
    the vcore->entry_exit_count field to make sure that it doesn't go
    into the guest if the other vcpus are exiting by the time that it
    is ready to actually enter the guest.

    Note that there is no fixed relationship between the hardware thread
    number and the vcpu number. Hardware threads are assigned to vcpus
    as they become runnable, so we will always use the lower-numbered
    hardware threads in preference to higher-numbered threads if not all
    the vcpus in the vcore are runnable, regardless of which vcpus are
    runnable.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This improves I/O performance for guests using the PAPR
    paravirtualization interface by making the H_PUT_TCE hcall faster, by
    implementing it in real mode. H_PUT_TCE is used for updating virtual
    IOMMU tables, and is used both for virtual I/O and for real I/O in the
    PAPR interface.

    Since this moves the IOMMU tables into the kernel, we define a new
    KVM_CREATE_SPAPR_TCE ioctl to allow qemu to create the tables. The
    ioctl returns a file descriptor which can be used to mmap the newly
    created table. The qemu driver models use them in the same way as
    userspace managed tables, but they can be updated directly by the
    guest with a real-mode H_PUT_TCE implementation, reducing the number
    of host/guest context switches during guest IO.

    There are certain circumstances where it is useful for userland qemu
    to write to the TCE table even if the kernel H_PUT_TCE path is used
    most of the time. Specifically, allowing this will avoid awkwardness
    when we need to reset the table. More importantly, we will in the
    future need to write the table in order to restore its state after a
    checkpoint resume or migration.

    Signed-off-by: David Gibson
    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    David Gibson
     
  • This adds the infrastructure for handling PAPR hcalls in the kernel,
    either early in the guest exit path while we are still in real mode,
    or later once the MMU has been turned back on and we are in the full
    kernel context. The advantage of handling hcalls in real mode if
    possible is that we avoid two partition switches -- and this will
    become more important when we support SMT4 guests, since a partition
    switch means we have to pull all of the threads in the core out of
    the guest. The disadvantage is that we can only access the kernel
    linear mapping, not anything vmalloced or ioremapped, since the MMU
    is off.

    This also adds code to handle the following hcalls in real mode:

    H_ENTER Add an HPTE to the hashed page table
    H_REMOVE Remove an HPTE from the hashed page table
    H_READ Read HPTEs from the hashed page table
    H_PROTECT Change the protection bits in an HPTE
    H_BULK_REMOVE Remove up to 4 HPTEs from the hashed page table
    H_SET_DABR Set the data address breakpoint register

    Plus code to handle the following hcalls in the kernel:

    H_CEDE Idle the vcpu until an interrupt or H_PROD hcall arrives
    H_PROD Wake up a ceded vcpu
    H_REGISTER_VPA Register a virtual processor area (VPA)

    The code that runs in real mode has to be in the base kernel, not in
    the module, if KVM is compiled as a module. The real-mode code can
    only access the kernel linear mapping, not vmalloc or ioremap space.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This adds support for KVM running on 64-bit Book 3S processors,
    specifically POWER7, in hypervisor mode. Using hypervisor mode means
    that the guest can use the processor's supervisor mode. That means
    that the guest can execute privileged instructions and access privileged
    registers itself without trapping to the host. This gives excellent
    performance, but does mean that KVM cannot emulate a processor
    architecture other than the one that the hardware implements.

    This code assumes that the guest is running paravirtualized using the
    PAPR (Power Architecture Platform Requirements) interface, which is the
    interface that IBM's PowerVM hypervisor uses. That means that existing
    Linux distributions that run on IBM pSeries machines will also run
    under KVM without modification. In order to communicate the PAPR
    hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
    to include/linux/kvm.h.

    Currently the choice between book3s_hv support and book3s_pr support
    (i.e. the existing code, which runs the guest in user mode) has to be
    made at kernel configuration time, so a given kernel binary can only
    do one or the other.

    This new book3s_hv code doesn't support MMIO emulation at present.
    Since we are running paravirtualized guests, this isn't a serious
    restriction.

    With the guest running in supervisor mode, most exceptions go straight
    to the guest. We will never get data or instruction storage or segment
    interrupts, alignment interrupts, decrementer interrupts, program
    interrupts, single-step interrupts, etc., coming to the hypervisor from
    the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
    exception entry path so that we don't have to do the KVM test on entry
    to those exception handlers.

    We do however get hypervisor decrementer, hypervisor data storage,
    hypervisor instruction storage, and hypervisor emulation assist
    interrupts, so we have to handle those.

    In hypervisor mode, real-mode accesses can access all of RAM, not just
    a limited amount. Therefore we put all the guest state in the vcpu.arch
    and use the shadow_vcpu in the PACA only for temporary scratch space.
    We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
    anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
    We don't have a shared page with the guest, but we still need a
    kvm_vcpu_arch_shared struct to store the values of various registers,
    so we include one in the vcpu_arch struct.

    The POWER7 processor has a restriction that all threads in a core have
    to be in the same partition. MMU-on kernel code counts as a partition
    (partition 0), so we have to do a partition switch on every entry to and
    exit from the guest. At present we require the host and guest to run
    in single-thread mode because of this hardware restriction.

    This code allocates a hashed page table for the guest and initializes
    it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
    require that the guest memory is allocated using 16MB huge pages, in
    order to simplify the low-level memory management. This also means that
    we can get away without tracking paging activity in the host for now,
    since huge pages can't be paged or swapped.

    This also adds a few new exports needed by the book3s_hv code.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • There are several fields in struct kvmppc_book3s_shadow_vcpu that
    temporarily store bits of host state while a guest is running,
    rather than anything relating to the particular guest or vcpu.
    This splits them out into a new kvmppc_host_state structure and
    modifies the definitions in asm-offsets.c to suit.

    On 32-bit, we have a kvmppc_host_state structure inside the
    kvmppc_book3s_shadow_vcpu since the assembly code needs to be able
    to get to them both with one pointer. On 64-bit they are separate
    fields in the PACA. This means that on 64-bit we don't need to
    copy the kvmppc_host_state in and out on vcpu load/unload, and
    in future will mean that the book3s_hv code doesn't need a
    shadow_vcpu struct in the PACA at all. That does mean that we
    have to be careful not to rely on any values persisting in the
    hstate field of the paca across any point where we could block
    or get preempted.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras