08 Dec, 2011

1 commit

  • This fixes a problem where a CPU thread coming out of nap mode can
    think it has valid values in the nonvolatile GPRs (r14 - r31) as saved
    away in power7_idle, but in fact the values have been trashed because
    the thread was used for KVM in the mean time. The result is that the
    thread crashes because code that called power7_idle (e.g.,
    pnv_smp_cpu_kill_self()) goes to use values in registers that have
    been trashed.

    The bit field in SRR1 that tells whether state was lost only reflects
    the most recent nap, which may not have been the nap instruction in
    power7_idle. So we need an extra PACA field to indicate that state
    has been lost even if SRR1 indicates that the most recent nap didn't
    lose state. We clear this field when saving the state in power7_idle,
    we set it to a non-zero value when we use the thread for KVM, and we
    test it in power7_wakeup_noloss.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt

    Paul Mackerras
     

07 Nov, 2011

1 commit

  • * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (106 commits)
    powerpc/p3060qds: Add support for P3060QDS board
    powerpc/83xx: Add shutdown request support to MCU handling on MPC8349 MITX
    powerpc/85xx: Make kexec to interate over online cpus
    powerpc/fsl_booke: Fix comment in head_fsl_booke.S
    powerpc/85xx: issue 15 EOI after core reset for FSL CoreNet devices
    powerpc/8xxx: Fix interrupt handling in MPC8xxx GPIO driver
    powerpc/85xx: Add 'fsl,pq3-gpio' compatiable for GPIO driver
    powerpc/86xx: Correct Gianfar support for GE boards
    powerpc/cpm: Clear muram before it is in use.
    drivers/virt: add ioctl for 32-bit compat on 64-bit to fsl-hv-manager
    powerpc/fsl_msi: add support for "msi-address-64" property
    powerpc/85xx: Setup secondary cores PIR with hard SMP id
    powerpc/fsl-booke: Fix settlbcam for 64-bit
    powerpc/85xx: Adding DCSR node to dtsi device trees
    powerpc/85xx: clean up FPGA device tree nodes for Freecsale QorIQ boards
    powerpc/85xx: fix PHYS_64BIT selection for P1022DS
    powerpc/fsl-booke: Fix setup_initial_memory_limit to not blindly map
    powerpc: respect mem= setting for early memory limit setup
    powerpc: Update corenet64_smp_defconfig
    powerpc: Update mpc85xx/corenet 32-bit defconfigs
    ...

    Fix up trivial conflicts in:
    - arch/powerpc/configs/40x/hcu4_defconfig
    removed stale file, edited elsewhere
    - arch/powerpc/include/asm/udbg.h, arch/powerpc/kernel/udbg.c:
    added opal and gelic drivers vs added ePAPR driver
    - drivers/tty/serial/8250.c
    moved UPIO_TSI to powerpc vs removed UPIO_DWAPB support

    Linus Torvalds
     

26 Sep, 2011

2 commits

  • With a KVM guest operating in SMT4 mode (i.e. 4 hardware threads per
    core), whenever a CPU goes idle, we have to pull all the other
    hardware threads in the core out of the guest, because the H_CEDE
    hcall is handled in the kernel. This is inefficient.

    This adds code to book3s_hv_rmhandlers.S to handle the H_CEDE hcall
    in real mode. When a guest vcpu does an H_CEDE hcall, we now only
    exit to the kernel if all the other vcpus in the same core are also
    idle. Otherwise we mark this vcpu as napping, save state that could
    be lost in nap mode (mainly GPRs and FPRs), and execute the nap
    instruction. When the thread wakes up, because of a decrementer or
    external interrupt, we come back in at kvm_start_guest (from the
    system reset interrupt vector), find the `napping' flag set in the
    paca, and go to the resume path.

    This has some other ramifications. First, when starting a core, we
    now start all the threads, both those that are immediately runnable and
    those that are idle. This is so that we don't have to pull all the
    threads out of the guest when an idle thread gets a decrementer interrupt
    and wants to start running. In fact the idle threads will all start
    with the H_CEDE hcall returning; being idle they will just do another
    H_CEDE immediately and go to nap mode.

    This required some changes to kvmppc_run_core() and kvmppc_run_vcpu().
    These functions have been restructured to make them simpler and clearer.
    We introduce a level of indirection in the wait queue that gets woken
    when external and decrementer interrupts get generated for a vcpu, so
    that we can have the 4 vcpus in a vcore using the same wait queue.
    We need this because the 4 vcpus are being handled by one thread.

    Secondly, when we need to exit from the guest to the kernel, we now
    have to generate an IPI for any napping threads, because an HDEC
    interrupt doesn't wake up a napping thread.

    Thirdly, we now need to be able to handle virtual external interrupts
    and decrementer interrupts becoming pending while a thread is napping,
    and deliver those interrupts to the guest when the thread wakes.
    This is done in kvmppc_cede_reentry, just before fast_guest_return.

    Finally, since we are not using the generic kvm_vcpu_block for book3s_hv,
    and hence not calling kvm_arch_vcpu_runnable, we can remove the #ifdef
    from kvm_arch_vcpu_runnable.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This simplifies the way that the book3s_pr makes the transition to
    real mode when entering the guest. We now call kvmppc_entry_trampoline
    (renamed from kvmppc_rmcall) in the base kernel using a normal function
    call instead of doing an indirect call through a pointer in the vcpu.
    If kvm is a module, the module loader takes care of generating a
    trampoline as it does for other calls to functions outside the module.

    kvmppc_entry_trampoline then disables interrupts and jumps to
    kvmppc_handler_trampoline_enter in real mode using an rfi[d].
    That then uses the link register as the address to return to
    (potentially in module space) when the guest exits.

    This also simplifies the way that we call the Linux interrupt handler
    when we exit the guest due to an external, decrementer or performance
    monitor interrupt. Instead of turning on the MMU, then deciding that
    we need to call the Linux handler and turning the MMU back off again,
    we now go straight to the handler at the point where we would turn the
    MMU on. The handler will then return to the virtual-mode code
    (potentially in the module).

    Along the way, this moves the setting and clearing of the HID5 DCBZ32
    bit into real-mode interrupts-off code, and also makes sure that
    we clear the MSR[RI] bit before loading values into SRR0/1.

    The net result is that we no longer need any code addresses to be
    stored in vcpu->arch.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     

20 Sep, 2011

1 commit

  • OPAL can handle various interrupt for us such as Machine Checks (it
    performs all sorts of recovery tasks and passes back control to us with
    informations about the error), Hardware Management Interrupts and Softpatch
    interrupts.

    This wires up the mechanisms and prints out specific informations returned
    by HAL when a machine check occurs.

    Signed-off-by: Benjamin Herrenschmidt

    Benjamin Herrenschmidt
     

26 Jul, 2011

1 commit

  • * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (99 commits)
    drivers/virt: add missing linux/interrupt.h to fsl_hypervisor.c
    powerpc/85xx: fix mpic configuration in CAMP mode
    powerpc: Copy back TIF flags on return from softirq stack
    powerpc/64: Make server perfmon only built on ppc64 server devices
    powerpc/pseries: Fix hvc_vio.c build due to recent changes
    powerpc: Exporting boot_cpuid_phys
    powerpc: Add CFAR to oops output
    hvc_console: Add kdb support
    powerpc/pseries: Fix hvterm_raw_get_chars to accept < 16 chars, fixing xmon
    powerpc/irq: Quieten irq mapping printks
    powerpc: Enable lockup and hung task detectors in pseries and ppc64 defeconfigs
    powerpc: Add mpt2sas driver to pseries and ppc64 defconfig
    powerpc: Disable IRQs off tracer in ppc64 defconfig
    powerpc: Sync pseries and ppc64 defconfigs
    powerpc/pseries/hvconsole: Fix dropped console output
    hvc_console: Improve tty/console put_chars handling
    powerpc/kdump: Fix timeout in crash_kexec_wait_realmode
    powerpc/mm: Fix output of total_ram.
    powerpc/cpufreq: Add cpufreq driver for Momentum Maple boards
    powerpc: Correct annotations of pmu registration functions
    ...

    Fix up trivial Kconfig/Makefile conflicts in arch/powerpc, drivers, and
    drivers/cpufreq

    Linus Torvalds
     

12 Jul, 2011

9 commits

  • This adds support for running KVM guests in supervisor mode on those
    PPC970 processors that have a usable hypervisor mode. Unfortunately,
    Apple G5 machines have supervisor mode disabled (MSR[HV] is forced to
    1), but the YDL PowerStation does have a usable hypervisor mode.

    There are several differences between the PPC970 and POWER7 in how
    guests are managed. These differences are accommodated using the
    CPU_FTR_ARCH_201 (PPC970) and CPU_FTR_ARCH_206 (POWER7) CPU feature
    bits. Notably, on PPC970:

    * The LPCR, LPID or RMOR registers don't exist, and the functions of
    those registers are provided by bits in HID4 and one bit in HID0.

    * External interrupts can be directed to the hypervisor, but unlike
    POWER7 they are masked by MSR[EE] in non-hypervisor modes and use
    SRR0/1 not HSRR0/1.

    * There is no virtual RMA (VRMA) mode; the guest must use an RMO
    (real mode offset) area.

    * The TLB entries are not tagged with the LPID, so it is necessary to
    flush the whole TLB on partition switch. Furthermore, when switching
    partitions we have to ensure that no other CPU is executing the tlbie
    or tlbsync instructions in either the old or the new partition,
    otherwise undefined behaviour can occur.

    * The PMU has 8 counters (PMC registers) rather than 6.

    * The DSCR, PURR, SPURR, AMR, AMOR, UAMOR registers don't exist.

    * The SLB has 64 entries rather than 32.

    * There is no mediated external interrupt facility, so if we switch to
    a guest that has a virtual external interrupt pending but the guest
    has MSR[EE] = 0, we have to arrange to have an interrupt pending for
    it so that we can get control back once it re-enables interrupts. We
    do that by sending ourselves an IPI with smp_send_reschedule after
    hard-disabling interrupts.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This adds infrastructure which will be needed to allow book3s_hv KVM to
    run on older POWER processors, including PPC970, which don't support
    the Virtual Real Mode Area (VRMA) facility, but only the Real Mode
    Offset (RMO) facility. These processors require a physically
    contiguous, aligned area of memory for each guest. When the guest does
    an access in real mode (MMU off), the address is compared against a
    limit value, and if it is lower, the address is ORed with an offset
    value (from the Real Mode Offset Register (RMOR)) and the result becomes
    the real address for the access. The size of the RMA has to be one of
    a set of supported values, which usually includes 64MB, 128MB, 256MB
    and some larger powers of 2.

    Since we are unlikely to be able to allocate 64MB or more of physically
    contiguous memory after the kernel has been running for a while, we
    allocate a pool of RMAs at boot time using the bootmem allocator. The
    size and number of the RMAs can be set using the kvm_rma_size=xx and
    kvm_rma_count=xx kernel command line options.

    KVM exports a new capability, KVM_CAP_PPC_RMA, to signal the availability
    of the pool of preallocated RMAs. The capability value is 1 if the
    processor can use an RMA but doesn't require one (because it supports
    the VRMA facility), or 2 if the processor requires an RMA for each guest.

    This adds a new ioctl, KVM_ALLOCATE_RMA, which allocates an RMA from the
    pool and returns a file descriptor which can be used to map the RMA. It
    also returns the size of the RMA in the argument structure.

    Having an RMA means we will get multiple KMV_SET_USER_MEMORY_REGION
    ioctl calls from userspace. To cope with this, we now preallocate the
    kvm->arch.ram_pginfo array when the VM is created with a size sufficient
    for up to 64GB of guest memory. Subsequently we will get rid of this
    array and use memory associated with each memslot instead.

    This moves most of the code that translates the user addresses into
    host pfns (page frame numbers) out of kvmppc_prepare_vrma up one level
    to kvmppc_core_prepare_memory_region. Also, instead of having to look
    up the VMA for each page in order to check the page size, we now check
    that the pages we get are compound pages of 16MB. However, if we are
    adding memory that is mapped to an RMA, we don't bother with calling
    get_user_pages_fast and instead just offset from the base pfn for the
    RMA.

    Typically the RMA gets added after vcpus are created, which makes it
    inconvenient to have the LPCR (logical partition control register) value
    in the vcpu->arch struct, since the LPCR controls whether the processor
    uses RMA or VRMA for the guest. This moves the LPCR value into the
    kvm->arch struct and arranges for the MER (mediated external request)
    bit, which is the only bit that varies between vcpus, to be set in
    assembly code when going into the guest if there is a pending external
    interrupt request.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This lifts the restriction that book3s_hv guests can only run one
    hardware thread per core, and allows them to use up to 4 threads
    per core on POWER7. The host still has to run single-threaded.

    This capability is advertised to qemu through a new KVM_CAP_PPC_SMT
    capability. The return value of the ioctl querying this capability
    is the number of vcpus per virtual CPU core (vcore), currently 4.

    To use this, the host kernel should be booted with all threads
    active, and then all the secondary threads should be offlined.
    This will put the secondary threads into nap mode. KVM will then
    wake them from nap mode and use them for running guest code (while
    they are still offline). To wake the secondary threads, we send
    them an IPI using a new xics_wake_cpu() function, implemented in
    arch/powerpc/sysdev/xics/icp-native.c. In other words, at this stage
    we assume that the platform has a XICS interrupt controller and
    we are using icp-native.c to drive it. Since the woken thread will
    need to acknowledge and clear the IPI, we also export the base
    physical address of the XICS registers using kvmppc_set_xics_phys()
    for use in the low-level KVM book3s code.

    When a vcpu is created, it is assigned to a virtual CPU core.
    The vcore number is obtained by dividing the vcpu number by the
    number of threads per core in the host. This number is exported
    to userspace via the KVM_CAP_PPC_SMT capability. If qemu wishes
    to run the guest in single-threaded mode, it should make all vcpu
    numbers be multiples of the number of threads per core.

    We distinguish three states of a vcpu: runnable (i.e., ready to execute
    the guest), blocked (that is, idle), and busy in host. We currently
    implement a policy that the vcore can run only when all its threads
    are runnable or blocked. This way, if a vcpu needs to execute elsewhere
    in the kernel or in qemu, it can do so without being starved of CPU
    by the other vcpus.

    When a vcore starts to run, it executes in the context of one of the
    vcpu threads. The other vcpu threads all go to sleep and stay asleep
    until something happens requiring the vcpu thread to return to qemu,
    or to wake up to run the vcore (this can happen when another vcpu
    thread goes from busy in host state to blocked).

    It can happen that a vcpu goes from blocked to runnable state (e.g.
    because of an interrupt), and the vcore it belongs to is already
    running. In that case it can start to run immediately as long as
    the none of the vcpus in the vcore have started to exit the guest.
    We send the next free thread in the vcore an IPI to get it to start
    to execute the guest. It synchronizes with the other threads via
    the vcore->entry_exit_count field to make sure that it doesn't go
    into the guest if the other vcpus are exiting by the time that it
    is ready to actually enter the guest.

    Note that there is no fixed relationship between the hardware thread
    number and the vcpu number. Hardware threads are assigned to vcpus
    as they become runnable, so we will always use the lower-numbered
    hardware threads in preference to higher-numbered threads if not all
    the vcpus in the vcore are runnable, regardless of which vcpus are
    runnable.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This adds the infrastructure for handling PAPR hcalls in the kernel,
    either early in the guest exit path while we are still in real mode,
    or later once the MMU has been turned back on and we are in the full
    kernel context. The advantage of handling hcalls in real mode if
    possible is that we avoid two partition switches -- and this will
    become more important when we support SMT4 guests, since a partition
    switch means we have to pull all of the threads in the core out of
    the guest. The disadvantage is that we can only access the kernel
    linear mapping, not anything vmalloced or ioremapped, since the MMU
    is off.

    This also adds code to handle the following hcalls in real mode:

    H_ENTER Add an HPTE to the hashed page table
    H_REMOVE Remove an HPTE from the hashed page table
    H_READ Read HPTEs from the hashed page table
    H_PROTECT Change the protection bits in an HPTE
    H_BULK_REMOVE Remove up to 4 HPTEs from the hashed page table
    H_SET_DABR Set the data address breakpoint register

    Plus code to handle the following hcalls in the kernel:

    H_CEDE Idle the vcpu until an interrupt or H_PROD hcall arrives
    H_PROD Wake up a ceded vcpu
    H_REGISTER_VPA Register a virtual processor area (VPA)

    The code that runs in real mode has to be in the base kernel, not in
    the module, if KVM is compiled as a module. The real-mode code can
    only access the kernel linear mapping, not vmalloc or ioremap space.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This adds support for KVM running on 64-bit Book 3S processors,
    specifically POWER7, in hypervisor mode. Using hypervisor mode means
    that the guest can use the processor's supervisor mode. That means
    that the guest can execute privileged instructions and access privileged
    registers itself without trapping to the host. This gives excellent
    performance, but does mean that KVM cannot emulate a processor
    architecture other than the one that the hardware implements.

    This code assumes that the guest is running paravirtualized using the
    PAPR (Power Architecture Platform Requirements) interface, which is the
    interface that IBM's PowerVM hypervisor uses. That means that existing
    Linux distributions that run on IBM pSeries machines will also run
    under KVM without modification. In order to communicate the PAPR
    hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
    to include/linux/kvm.h.

    Currently the choice between book3s_hv support and book3s_pr support
    (i.e. the existing code, which runs the guest in user mode) has to be
    made at kernel configuration time, so a given kernel binary can only
    do one or the other.

    This new book3s_hv code doesn't support MMIO emulation at present.
    Since we are running paravirtualized guests, this isn't a serious
    restriction.

    With the guest running in supervisor mode, most exceptions go straight
    to the guest. We will never get data or instruction storage or segment
    interrupts, alignment interrupts, decrementer interrupts, program
    interrupts, single-step interrupts, etc., coming to the hypervisor from
    the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
    exception entry path so that we don't have to do the KVM test on entry
    to those exception handlers.

    We do however get hypervisor decrementer, hypervisor data storage,
    hypervisor instruction storage, and hypervisor emulation assist
    interrupts, so we have to handle those.

    In hypervisor mode, real-mode accesses can access all of RAM, not just
    a limited amount. Therefore we put all the guest state in the vcpu.arch
    and use the shadow_vcpu in the PACA only for temporary scratch space.
    We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
    anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
    We don't have a shared page with the guest, but we still need a
    kvm_vcpu_arch_shared struct to store the values of various registers,
    so we include one in the vcpu_arch struct.

    The POWER7 processor has a restriction that all threads in a core have
    to be in the same partition. MMU-on kernel code counts as a partition
    (partition 0), so we have to do a partition switch on every entry to and
    exit from the guest. At present we require the host and guest to run
    in single-thread mode because of this hardware restriction.

    This code allocates a hashed page table for the guest and initializes
    it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
    require that the guest memory is allocated using 16MB huge pages, in
    order to simplify the low-level memory management. This also means that
    we can get away without tracking paging activity in the host for now,
    since huge pages can't be paged or swapped.

    This also adds a few new exports needed by the book3s_hv code.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • There are several fields in struct kvmppc_book3s_shadow_vcpu that
    temporarily store bits of host state while a guest is running,
    rather than anything relating to the particular guest or vcpu.
    This splits them out into a new kvmppc_host_state structure and
    modifies the definitions in asm-offsets.c to suit.

    On 32-bit, we have a kvmppc_host_state structure inside the
    kvmppc_book3s_shadow_vcpu since the assembly code needs to be able
    to get to them both with one pointer. On 64-bit they are separate
    fields in the PACA. This means that on 64-bit we don't need to
    copy the kvmppc_host_state in and out on vcpu load/unload, and
    in future will mean that the book3s_hv code doesn't need a
    shadow_vcpu struct in the PACA at all. That does mean that we
    have to be careful not to rely on any values persisting in the
    hstate field of the paca across any point where we could block
    or get preempted.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • Dynamically assign host PIDs to guest PIDs, splitting each guest PID into
    multiple host (shadow) PIDs based on kernel/user and MSR[IS/DS]. Use
    both PID0 and PID1 so that the shadow PIDs for the right mode can be
    selected, that correspond both to guest TID = zero and guest TID = guest
    PID.

    This allows us to significantly reduce the frequency of needing to
    invalidate the entire TLB. When the guest mode or PID changes, we just
    update the host PID0/PID1. And since the allocation of shadow PIDs is
    global, multiple guests can share the TLB without conflict.

    Note that KVM does not yet support the guest setting PID1 or PID2 to
    a value other than zero. This will need to be fixed for nested KVM
    to work. Until then, we enforce the requirement for guest PID1/PID2
    to stay zero by failing the emulation if the guest tries to set them
    to something else.

    Signed-off-by: Liu Yu
    Signed-off-by: Scott Wood
    Signed-off-by: Alexander Graf

    Liu Yu
     
  • This is done lazily. The SPE save will be done only if the guest has
    used SPE since the last preemption or heavyweight exit. Restore will be
    done only on demand, when enabling MSR_SPE in the shadow MSR, in response
    to an SPE fault or mtmsr emulation.

    For SPEFSCR, Linux already switches it on context switch (non-lazily), so
    the only remaining bit is to save it between qemu and the guest.

    Signed-off-by: Liu Yu
    Signed-off-by: Scott Wood
    Signed-off-by: Alexander Graf

    Scott Wood
     
  • Keep the guest MSR and the guest-mode true MSR separate, rather than
    modifying the guest MSR on each guest entry to produce a true MSR.

    Any bits which should be modified based on guest MSR must be explicitly
    propagated from vcpu->arch.shared->msr to vcpu->arch.shadow_msr in
    kvmppc_set_msr().

    While we're modifying the guest entry code, reorder a few instructions
    to bury some load latencies.

    Signed-off-by: Scott Wood
    Signed-off-by: Alexander Graf

    Scott Wood
     

23 Jun, 2011

1 commit

  • We expect this is actually faster, and we end up needing more space than we
    can get from the SPRGs in some instances. This is also useful when running
    as a guest OS - SPRGs4-7 do not have guest versions.

    8 slots are allocated in thread_info for this even though we only actually
    use 4 of them - this allows space for future code to have more scratch
    space (and we know we'll need it for things like hugetlb).

    Signed-off-by: Ashish Kalra
    Signed-off-by: Becky Bruce
    Signed-off-by: Kumar Gala

    Ashish Kalra
     

23 May, 2011

1 commit

  • * 'kvm-updates/2.6.40' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (131 commits)
    KVM: MMU: Use ptep_user for cmpxchg_gpte()
    KVM: Fix kvm mmu_notifier initialization order
    KVM: Add documentation for KVM_CAP_NR_VCPUS
    KVM: make guest mode entry to be rcu quiescent state
    KVM: x86 emulator: Make jmp far emulation into a separate function
    KVM: x86 emulator: Rename emulate_grpX() to em_grpX()
    KVM: x86 emulator: Remove unused arg from emulate_pop()
    KVM: x86 emulator: Remove unused arg from writeback()
    KVM: x86 emulator: Remove unused arg from read_descriptor()
    KVM: x86 emulator: Remove unused arg from seg_override()
    KVM: Validate userspace_addr of memslot when registered
    KVM: MMU: Clean up gpte reading with copy_from_user()
    KVM: PPC: booke: add sregs support
    KVM: PPC: booke: save/restore VRSAVE (a.k.a. USPRG0)
    KVM: PPC: use ticks, not usecs, for exit timing
    KVM: PPC: fix exit accounting for SPRs, tlbwe, tlbsx
    KVM: PPC: e500: emulate SVR
    KVM: VMX: Cache vmcs segment fields
    KVM: x86 emulator: consolidate segment accessors
    KVM: VMX: Avoid reading %rip unnecessarily when handling exceptions
    ...

    Linus Torvalds
     

22 May, 2011

1 commit


27 Apr, 2011

1 commit

  • The DSCR (aka Data Stream Control Register) is supported on some
    server PowerPC chips and allow some control over the prefetch
    of data streams.

    This patch allows the value to be specified per thread by emulating
    the corresponding mfspr and mtspr instructions. Children of such
    threads inherit the value. Other threads use a default value that
    can be specified in sysfs - /sys/devices/system/cpu/dscr_default.

    If a thread starts with non default value in the sysfs entry,
    all children threads inherit this non default value even if
    the sysfs value is changed later.

    Signed-off-by: Alexey Kardashevskiy
    Signed-off-by: Benjamin Herrenschmidt

    Alexey Kardashevskiy
     

29 Nov, 2010

1 commit


25 Oct, 2010

1 commit

  • * 'kvm-updates/2.6.37' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (321 commits)
    KVM: Drop CONFIG_DMAR dependency around kvm_iommu_map_pages
    KVM: Fix signature of kvm_iommu_map_pages stub
    KVM: MCE: Send SRAR SIGBUS directly
    KVM: MCE: Add MCG_SER_P into KVM_MCE_CAP_SUPPORTED
    KVM: fix typo in copyright notice
    KVM: Disable interrupts around get_kernel_ns()
    KVM: MMU: Avoid sign extension in mmu_alloc_direct_roots() pae root address
    KVM: MMU: move access code parsing to FNAME(walk_addr) function
    KVM: MMU: audit: check whether have unsync sps after root sync
    KVM: MMU: audit: introduce audit_printk to cleanup audit code
    KVM: MMU: audit: unregister audit tracepoints before module unloaded
    KVM: MMU: audit: fix vcpu's spte walking
    KVM: MMU: set access bit for direct mapping
    KVM: MMU: cleanup for error mask set while walk guest page table
    KVM: MMU: update 'root_hpa' out of loop in PAE shadow path
    KVM: x86 emulator: Eliminate compilation warning in x86_decode_insn()
    KVM: x86: Fix constant type in kvm_get_time_scale
    KVM: VMX: Add AX to list of registers clobbered by guest switch
    KVM guest: Move a printk that's using the clock before it's ready
    KVM: x86: TSC catchup mode
    ...

    Linus Torvalds
     

24 Oct, 2010

5 commits

  • This is the guest side of the mtsr acceleration. Using this a guest can now
    call mtsrin with almost no overhead as long as it ensures that it only uses
    it with (MSR_IR|MSR_DR) == 0. Linux does that, so we're good.

    Signed-off-by: Alexander Graf

    Alexander Graf
     
  • When CONFIG_KVM_GUEST is selected, but CONFIG_KVM is not, we were missing
    some defines in asm-offsets.c and included too many headers at other places.

    This patch makes above configuration work.

    Reported-by: Stephen Rothwell
    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • We have all the hypervisor pieces in place now, but the guest parts are still
    missing.

    This patch implements basic awareness of KVM when running Linux as guest. It
    doesn't do anything with it yet though.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • One of the most obvious registers to share with the guest directly is the
    MSR. The MSR contains the "interrupts enabled" flag which the guest has to
    toggle in critical sections.

    So in order to bring the overhead of interrupt en- and disabling down, let's
    put msr into the shared page. Keep in mind that even though you can fully read
    its contents, writing to it doesn't always update all state. There are a few
    safe fields that don't require hypervisor interaction. See the documentation
    for a list of MSR bits that are safe to be set from inside the guest.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • For transparent variable sharing between the hypervisor and guest, I introduce
    a shared page. This shared page will contain all the registers the guest can
    read and write safely without exiting guest context.

    This patch only implements the stubs required for the basic structure of the
    shared page. The actual register moving follows.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     

14 Oct, 2010

1 commit

  • On Freescale parts typically have TLB array for large mappings that we can
    bolt the linear mapping into. We utilize the code that already exists
    on PPC32 on the 64-bit side to setup the linear mapping to be cover by
    bolted TLB entries. We utilize a quarter of the variable size TLB array
    for this purpose.

    Additionally, we limit the amount of memory to what we can cover via
    bolted entries so we don't get secondary faults in the TLB miss
    handlers. We should fix this limitation in the future.

    Signed-off-by: Kumar Gala

    Kumar Gala
     

02 Sep, 2010

1 commit

  • Currently, when CONFIG_VIRT_CPU_ACCOUNTING is enabled, we use the
    PURR register for measuring the user and system time used by
    processes, as well as other related times such as hardirq and
    softirq times. This turns out to be quite confusing for users
    because it means that a program will often be measured as taking
    less time when run on a multi-threaded processor (SMT2 or SMT4 mode)
    than it does when run on a single-threaded processor (ST mode), even
    though the program takes longer to finish. The discrepancy is
    accounted for as stolen time, which is also confusing, particularly
    when there are no other partitions running.

    This changes the accounting to use the timebase instead, meaning that
    the reported user and system times are the actual number of real-time
    seconds that the program was executing on the processor thread,
    regardless of which SMT mode the processor is in. Thus a program will
    generally show greater user and system times when run on a
    multi-threaded processor than on a single-threaded processor.

    On pSeries systems on POWER5 or later processors, we measure the
    stolen time (time when this partition wasn't running) using the
    hypervisor dispatch trace log. We check for new entries in the
    log on every entry from user mode and on every transition from
    kernel process context to soft or hard IRQ context (i.e. when
    account_system_vtime() gets called). So that we can correctly
    distinguish time stolen from user time and time stolen from system
    time, without having to check the log on every exit to user mode,
    we store separate timestamps for exit to user mode and entry from
    user mode.

    On systems that have a SPURR (POWER6 and POWER7), we read the SPURR
    in account_system_vtime() (as before), and then apportion the SPURR
    ticks since the last time we read it between scaled user time and
    scaled system time according to the relative proportions of user
    time and system time over the same interval. This avoids having to
    read the SPURR on every kernel entry and exit. On systems that have
    PURR but not SPURR (i.e., POWER5), we do the same using the PURR
    rather than the SPURR.

    This disables the DTL user interface in /sys/debug/kernel/powerpc/dtl
    for now since it conflicts with the use of the dispatch trace log
    by the time accounting code.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt

    Paul Mackerras
     

09 Jul, 2010

2 commits

  • Now we dynamically allocate the paca array, it takes an extra load
    whenever we want to access another cpu's paca. One place we do that a lot
    is per cpu variables. A simple example:

    DEFINE_PER_CPU(unsigned long, vara);
    unsigned long test4(int cpu)
    {
    return per_cpu(vara, cpu);
    }

    This takes 4 loads, 5 if you include the actual load of the per cpu variable:

    ld r11,-32760(r30) # load address of paca pointer
    ld r9,-32768(r30) # load link address of percpu variable
    sldi r3,r29,9 # get offset into paca (each entry is 512 bytes)
    ld r0,0(r11) # load paca pointer
    add r3,r0,r3 # paca + offset
    ld r11,64(r3) # load paca[cpu].data_offset

    ldx r3,r9,r11 # load per cpu variable

    If we remove the ppc64 specific per_cpu_offset(), we get the generic one
    which indexes into a statically allocated array. This removes one load and
    one add:

    ld r11,-32760(r30) # load address of __per_cpu_offset
    ld r9,-32768(r30) # load link address of percpu variable
    sldi r3,r29,3 # get offset into __per_cpu_offset (each entry 8 bytes)
    ldx r11,r11,r3 # load __per_cpu_offset[cpu]

    ldx r3,r9,r11 # load per cpu variable

    Having all the offsets in one array also helps when iterating over a per cpu
    variable across a number of cpus, such as in the scheduler. Before we would
    need to load one paca cacheline when calculating each per cpu offset. Now we
    have 16 (128 / sizeof(long)) per cpu offsets in each cacheline.

    Signed-off-by: Anton Blanchard
    Signed-off-by: Benjamin Herrenschmidt

    Anton Blanchard
     
  • Currently it is possible for userspace to see the result of
    gettimeofday() going backwards by 1 microsecond, assuming that
    userspace is using the gettimeofday() in the VDSO. The VDSO
    gettimeofday() algorithm computes the time in "xsecs", which are
    units of 2^-20 seconds, or approximately 0.954 microseconds,
    using the algorithm

    now = (timebase - tb_orig_stamp) * tb_to_xs + stamp_xsec

    and then converts the time in xsecs to seconds and microseconds.

    The kernel updates the tb_orig_stamp and stamp_xsec values every
    tick in update_vsyscall(). If the length of the tick is not an
    integer number of xsecs, then some precision is lost in converting
    the current time to xsecs. For example, with CONFIG_HZ=1000, the
    tick is 1ms long, which is 1048.576 xsecs. That means that
    stamp_xsec will advance by either 1048 or 1049 on each tick.
    With the right conditions, it is possible for userspace to get
    (timebase - tb_orig_stamp) * tb_to_xs being 1049 if the kernel is
    slightly late in updating the vdso_datapage, and then for stamp_xsec
    to advance by 1048 when the kernel does update it, and for userspace
    to then see (timebase - tb_orig_stamp) * tb_to_xs being zero due to
    integer truncation. The result is that time appears to go backwards
    by 1 microsecond.

    To fix this we change the VDSO gettimeofday to use a new field in the
    VDSO datapage which stores the nanoseconds part of the time as a
    fractional number of seconds in a 0.32 binary fraction format.
    (Or put another way, as a 32-bit number in units of 0.23283 ns.)
    This is convenient because we can use the mulhwu instruction to
    convert it to either microseconds or nanoseconds.

    Since it turns out that computing the time of day using this new field
    is simpler than either using stamp_xsec (as gettimeofday does) or
    stamp_xtime.tv_nsec (as clock_gettime does), this converts both
    gettimeofday and clock_gettime to use the new field. The existing
    __do_get_tspec function is converted to use the new field and take
    a parameter in r7 that indicates the desired resolution, 1,000,000
    for microseconds or 1,000,000,000 for nanoseconds. The __do_get_xsec
    function is then unused and is deleted.

    The new algorithm is

    now = ((timebase - tb_orig_stamp) << 12) * tb_to_xs
    + (stamp_xtime_seconds << 32) + stamp_sec_fraction

    with 'now' in units of 2^-32 seconds. That is then converted to
    seconds and either microseconds or nanoseconds with

    seconds = now >> 32
    partseconds = ((now & 0xffffffff) * resolution) >> 32

    The 32-bit VDSO code also makes a further simplification: it ignores
    the bottom 32 bits of the tb_to_xs value, which is a 0.64 format binary
    fraction. Doing so gets rid of 4 multiply instructions. Assuming
    a timebase frequency of 1GHz or less and an update interval of no
    more than 10ms, the upper 32 bits of tb_to_xs will be at least
    4503599, so the error from ignoring the low 32 bits will be at most
    2.2ns, which is more than an order of magnitude less than the time
    taken to do gettimeofday or clock_gettime on our fastest processors,
    so there is no possibility of seeing inconsistent values due to this.

    This also moves update_gtod() down next to its only caller, and makes
    update_vsyscall use the time passed in via the wall_time argument rather
    than accessing xtime directly. At present, wall_time always points to
    xtime, but that could change in future.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Benjamin Herrenschmidt

    Paul Mackerras
     

22 May, 2010

2 commits

  • * 'kvm-updates/2.6.35' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (269 commits)
    KVM: x86: Add missing locking to arch specific vcpu ioctls
    KVM: PPC: Add missing vcpu_load()/vcpu_put() in vcpu ioctls
    KVM: MMU: Segregate shadow pages with different cr0.wp
    KVM: x86: Check LMA bit before set_efer
    KVM: Don't allow lmsw to clear cr0.pe
    KVM: Add cpuid.txt file
    KVM: x86: Tell the guest we'll warn it about tsc stability
    x86, paravirt: don't compute pvclock adjustments if we trust the tsc
    x86: KVM guest: Try using new kvm clock msrs
    KVM: x86: export paravirtual cpuid flags in KVM_GET_SUPPORTED_CPUID
    KVM: x86: add new KVMCLOCK cpuid feature
    KVM: x86: change msr numbers for kvmclock
    x86, paravirt: Add a global synchronization point for pvclock
    x86, paravirt: Enable pvclock flags in vcpu_time_info structure
    KVM: x86: Inject #GP with the right rip on efer writes
    KVM: SVM: Don't allow nested guest to VMMCALL into host
    KVM: x86: Fix exception reinjection forced to true
    KVM: Fix wallclock version writing race
    KVM: MMU: Don't read pdptrs with mmu spinlock held in mmu_alloc_roots
    KVM: VMX: enable VMXON check with SMX enabled (Intel TXT)
    ...

    Linus Torvalds
     
  • * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (92 commits)
    powerpc: Remove unused 'protect4gb' boot parameter
    powerpc: Build-in e1000e for pseries & ppc64_defconfig
    powerpc/pseries: Make request_ras_irqs() available to other pseries code
    powerpc/numa: Use ibm,architecture-vec-5 to detect form 1 affinity
    powerpc/numa: Set a smaller value for RECLAIM_DISTANCE to enable zone reclaim
    powerpc: Use smt_snooze_delay=-1 to always busy loop
    powerpc: Remove check of ibm,smt-snooze-delay OF property
    powerpc/kdump: Fix race in kdump shutdown
    powerpc/kexec: Fix race in kexec shutdown
    powerpc/kexec: Speedup kexec hash PTE tear down
    powerpc/pseries: Add hcall to read 4 ptes at a time in real mode
    powerpc: Use more accurate limit for first segment memory allocations
    powerpc/kdump: Use chip->shutdown to disable IRQs
    powerpc/kdump: CPUs assume the context of the oopsing CPU
    powerpc/crashdump: Do not fail on NULL pointer dereferencing
    powerpc/eeh: Fix oops when probing in early boot
    powerpc/pci: Check devices status property when scanning OF tree
    powerpc/vio: Switch VIO Bus PM to use generic helpers
    powerpc: Avoid bad relocations in iSeries code
    powerpc: Use common cpu_die (fixes SMP+SUSPEND build)
    ...

    Linus Torvalds
     

21 May, 2010

1 commit

  • In kexec_prepare_cpus, the primary CPU IPIs the secondary CPUs to
    kexec_smp_down(). kexec_smp_down() calls kexec_smp_wait() which sets
    the hw_cpu_id() to -1. The primary does this while leaving IRQs on
    which means the primary can take a timer interrupt which can lead to
    the IPIing one of the secondary CPUs (say, for a scheduler re-balance)
    but since the secondary CPU now has a hw_cpu_id = -1, we IPI CPU
    -1... Kaboom!

    We are hitting this case regularly on POWER7 machines.

    There is also a second race, where the primary will tear down the MMU
    mappings before knowing the secondaries have entered real mode.

    Also, the secondaries are clearing out any pending IPIs before
    guaranteeing that no more will be received.

    This changes kexec_prepare_cpus() so that we turn off IRQs in the
    primary CPU much earlier. It adds a paca flag to say that the
    secondaries have entered the kexec_smp_down() IPI and turned off IRQs,
    rather than overloading hw_cpu_id with -1. This new paca flag is
    again used to in indicate when the secondaries has entered real mode.

    It also ensures that all CPUs have their IRQs off before we clear out
    any pending IPI requests (in kexec_cpu_down()) to ensure there are no
    trailing IPIs left unacknowledged.

    Signed-off-by: Michael Neuling
    Signed-off-by: Benjamin Herrenschmidt

    Michael Neuling
     

17 May, 2010

5 commits

  • When we build with ftrace enabled its possible that loadcam_entry would
    have used the stack pointer (even though the code doesn't need it). We
    call loadcam_entry in __secondary_start before the stack is setup. To
    ensure that loadcam_entry doesn't use the stack pointer the easiest
    solution is to just have it in asm code.

    Signed-off-by: Kumar Gala

    Kumar Gala
     
  • We need the SWITCH_FRAME_SIZE define on Book3S_32 now too.
    So let's export it unconditionally.

    CC: Benjamin Herrenschmidt
    Signed-off-by: Alexander Graf
    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • We need to keep the pointer to the shadow vcpu somewhere accessible from
    within really early interrupt code. The best fit I found was the thread
    struct, as that resides in an SPRG.

    So let's put a pointer to the shadow vcpu in the thread struct and add
    an asm-offset so we can find it.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • The shadow vcpu now contains some fields we don't use from the vcpu anymore.
    Access to them happens using inline functions that happily use the shadow
    vcpu fields.

    So let's now ifdef them out to booke only and add asm-offsets.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • Upstream recently added a new name for PPC64: Book3S_64.

    So instead of using CONFIG_PPC64 we should use CONFIG_PPC_BOOK3S consotently.
    That makes understanding the code easier (I hope).

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     

12 May, 2010

1 commit

  • Anton Blanchard found that large POWER systems would occasionally
    crash in the exception exit path when profiling with perf_events.
    The symptom was that an interrupt would occur late in the exit path
    when the MSR[RI] (recoverable interrupt) bit was clear. Interrupts
    should be hard-disabled at this point but they were enabled. Because
    the interrupt was not recoverable the system panicked.

    The reason is that the exception exit path was calling
    perf_event_do_pending after hard-disabling interrupts, and
    perf_event_do_pending will re-enable interrupts.

    The simplest and cleanest fix for this is to use the same mechanism
    that 32-bit powerpc does, namely to cause a self-IPI by setting the
    decrementer to 1. This means we can remove the tests in the exception
    exit path and raw_local_irq_restore.

    This also makes sure that the call to perf_event_do_pending from
    timer_interrupt() happens within irq_enter/irq_exit. (Note that
    calling perf_event_do_pending from timer_interrupt does not mean that
    there is a possible 1/HZ latency; setting the decrementer to 1 ensures
    that the timer interrupt will happen immediately, i.e. within one
    timebase tick, which is a few nanoseconds or 10s of nanoseconds.)

    Signed-off-by: Paul Mackerras
    Cc: stable@kernel.org
    Signed-off-by: Benjamin Herrenschmidt

    Paul Mackerras
     

01 Mar, 2010

1 commit

  • SRR1 stores more information that just the MSR value. It also stores
    valuable information about the type of interrupt we received, for
    example whether the storage interrupt we just got was because of a
    missing htab entry or not.

    We use that information to speed up the exit path.

    Now if we get preempted before we can interpret the shadow_msr values,
    we get into vcpu_put which then calls the MSR handler, which then sets
    all the SRR1 information bits in shadow_msr to 0. Great.

    So let's preserve the SRR1 specific bits in shadow_msr whenever we set
    the MSR. They don't hurt.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf