28 Apr, 2014

2 commits

  • When the guest cedes the vcpu or the vcpu has no guest to
    run it naps. Clear the runlatch bit of the vcpu before
    napping to indicate an idle cpu.

    Signed-off-by: Preeti U Murthy
    Acked-by: Paul Mackerras
    Reviewed-by: Srivatsa S. Bhat
    Signed-off-by: Benjamin Herrenschmidt

    Preeti U Murthy
     
  • The secondary threads in the core are kept offline before launching guests
    in kvm on powerpc: "371fefd6f2dc4666:KVM: PPC: Allow book3s_hv guests to use
    SMT processor modes."

    Hence their runlatch bits are cleared. When the secondary threads are called
    in to start a guest, their runlatch bits need to be set to indicate that they
    are busy. The primary thread has its runlatch bit set though, but there is no
    harm in setting this bit once again. Hence set the runlatch bit for all
    threads before they start guest.

    Signed-off-by: Preeti U Murthy
    Acked-by: Paul Mackerras
    Reviewed-by: Srivatsa S. Bhat
    Signed-off-by: Benjamin Herrenschmidt

    Preeti U Murthy
     

03 Apr, 2014

2 commits

  • Pull kvm updates from Paolo Bonzini:
    "PPC and ARM do not have much going on this time. Most of the cool
    stuff, instead, is in s390 and (after a few releases) x86.

    ARM has some caching fixes and PPC has transactional memory support in
    guests. MIPS has some fixes, with more probably coming in 3.16 as
    QEMU will soon get support for MIPS KVM.

    For x86 there are optimizations for debug registers, which trigger on
    some Windows games, and other important fixes for Windows guests. We
    now expose to the guest Broadwell instruction set extensions and also
    Intel MPX. There's also a fix/workaround for OS X guests, nested
    virtualization features (preemption timer), and a couple kvmclock
    refinements.

    For s390, the main news is asynchronous page faults, together with
    improvements to IRQs (floating irqs and adapter irqs) that speed up
    virtio devices"

    * tag 'kvm-3.15-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (96 commits)
    KVM: PPC: Book3S HV: Save/restore host PMU registers that are new in POWER8
    KVM: PPC: Book3S HV: Fix decrementer timeouts with non-zero TB offset
    KVM: PPC: Book3S HV: Don't use kvm_memslots() in real mode
    KVM: PPC: Book3S HV: Return ENODEV error rather than EIO
    KVM: PPC: Book3S: Trim top 4 bits of physical address in RTAS code
    KVM: PPC: Book3S HV: Add get/set_one_reg for new TM state
    KVM: PPC: Book3S HV: Add transactional memory support
    KVM: Specify byte order for KVM_EXIT_MMIO
    KVM: vmx: fix MPX detection
    KVM: PPC: Book3S HV: Fix KVM hang with CONFIG_KVM_XICS=n
    KVM: PPC: Book3S: Introduce hypervisor call H_GET_TCE
    KVM: PPC: Book3S HV: Fix incorrect userspace exit on ioeventfd write
    KVM: s390: clear local interrupts at cpu initial reset
    KVM: s390: Fix possible memory leak in SIGP functions
    KVM: s390: fix calculation of idle_mask array size
    KVM: s390: randomize sca address
    KVM: ioapic: reinject pending interrupts on KVM_SET_IRQCHIP
    KVM: Bump KVM_MAX_IRQ_ROUTES for s390
    KVM: s390: irq routing for adapter interrupts.
    KVM: s390: adapter interrupt sources
    ...

    Linus Torvalds
     
  • Pull main powerpc updates from Ben Herrenschmidt:
    "This time around, the powerpc merges are going to be a little bit more
    complicated than usual.

    This is the main pull request with most of the work for this merge
    window. I will describe it a bit more further down.

    There is some additional cpuidle driver work, however I haven't
    included it in this tree as it depends on some work in tip/timer-core
    which Thomas accidentally forgot to put in a topic branch. Since I
    didn't want to carry all of that tip timer stuff in powerpc -next, I
    setup a separate branch on top of Thomas tree with just that cpuidle
    driver in it, and Stephen has been carrying that in next separately
    for a while now. I'll send a separate pull request for it.

    Additionally, two new pieces in this tree add users for a sysfs API
    that Tejun and Greg have been deprecating in drivers-core-next.
    Thankfully Greg reverted the patch that removes the old API so this
    merge can happen cleanly, but once merged, I will send a patch
    adjusting our new code to the new API so that Greg can send you the
    removal patch.

    Now as for the content of this branch, we have a lot of perf work for
    power8 new counters including support for our new "nest" counters
    (also called 24x7) under pHyp (not natively yet).

    We have new functionality when running under the OPAL firmware
    (non-virtualized or KVM host), such as access to the firmware error
    logs and service processor dumps, system parameters and sensors, along
    with a hwmon driver for the latter.

    There's also a bunch of bug fixes accross the board, some LE fixes,
    and a nice set of selftests for validating our various types of copy
    loops.

    On the Freescale side, we see mostly new chip/board revisions, some
    clock updates, better support for machine checks and debug exceptions,
    etc..."

    * 'next' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc: (70 commits)
    powerpc/book3s: Fix CFAR clobbering issue in machine check handler.
    powerpc/compat: 32-bit little endian machine name is ppcle, not ppc
    powerpc/le: Big endian arguments for ppc_rtas()
    powerpc: Use default set of netfilter modules (CONFIG_NETFILTER_ADVANCED=n)
    powerpc/defconfigs: Enable THP in pseries defconfig
    powerpc/mm: Make sure a local_irq_disable prevent a parallel THP split
    powerpc: Rate-limit users spamming kernel log buffer
    powerpc/perf: Fix handling of L3 events with bank == 1
    powerpc/perf/hv_{gpci, 24x7}: Add documentation of device attributes
    powerpc/perf: Add kconfig option for hypervisor provided counters
    powerpc/perf: Add support for the hv 24x7 interface
    powerpc/perf: Add support for the hv gpci (get performance counter info) interface
    powerpc/perf: Add macros for defining event fields & formats
    powerpc/perf: Add a shared interface to get gpci version and capabilities
    powerpc/perf: Add 24x7 interface headers
    powerpc/perf: Add hv_gpci interface header
    powerpc: Add hvcalls for 24x7 and gpci (Get Performance Counter Info)
    sysfs: create bin_attributes under the requested group
    powerpc/perf: Enable BHRB access for EBB events
    powerpc/perf: Add BHRB constraint and IFM MMCRA handling for EBB
    ...

    Linus Torvalds
     

29 Mar, 2014

3 commits

  • Currently we save the host PMU configuration, counter values, etc.,
    when entering a guest, and restore it on return from the guest.
    (We have to do this because the guest has control of the PMU while
    it is executing.) However, we missed saving/restoring the SIAR and
    SDAR registers, as well as the registers which are new on POWER8,
    namely SIER and MMCR2.

    This adds code to save the values of these registers when entering
    the guest and restore them on exit. This also works around the bug
    in POWER8 where setting PMAE with a counter already negative doesn't
    generate an interrupt.

    Signed-off-by: Paul Mackerras
    Acked-by: Scott Wood

    Paul Mackerras
     
  • Commit c7699822bc21 ("KVM: PPC: Book3S HV: Make physical thread 0 do
    the MMU switching") reordered the guest entry/exit code so that most
    of the guest register save/restore code happened in guest MMU context.
    A side effect of that is that the timebase still contains the guest
    timebase value at the point where we compute and use vcpu->arch.dec_expires,
    and therefore that is now a guest timebase value rather than a host
    timebase value. That in turn means that the timeouts computed in
    kvmppc_set_timer() are wrong if the timebase offset for the guest is
    non-zero. The consequence of that is things such as "sleep 1" in a
    guest after migration may sleep for much longer than they should.

    This fixes the problem by converting between guest and host timebase
    values as necessary, by adding or subtracting the timebase offset.
    This also fixes an incorrect comment.

    Signed-off-by: Paul Mackerras
    Acked-by: Scott Wood

    Paul Mackerras
     
  • This adds saving of the transactional memory (TM) checkpointed state
    on guest entry and exit. We only do this if we see that the guest has
    an active transaction.

    It also adds emulation of the TM state changes when delivering IRQs
    into the guest. According to the architecture, if we are
    transactional when an IRQ occurs, the TM state is changed to
    suspended, otherwise it's left unchanged.

    Signed-off-by: Michael Neuling
    Signed-off-by: Paul Mackerras
    Acked-by: Scott Wood

    Michael Neuling
     

26 Mar, 2014

1 commit

  • This introduces the H_GET_TCE hypervisor call, which is basically the
    reverse of H_PUT_TCE, as defined in the Power Architecture Platform
    Requirements (PAPR).

    The hcall H_GET_TCE is required by the kdump kernel, which uses it to
    retrieve TCEs set up by the previous (panicked) kernel.

    Signed-off-by: Laurent Dufour
    Signed-off-by: Alexander Graf
    Signed-off-by: Paul Mackerras

    Laurent Dufour
     

20 Mar, 2014

1 commit

  • Previously SPRG3 was marked for use by both VDSO and critical
    interrupts (though critical interrupts were not fully implemented).

    In commit 8b64a9dfb091f1eca8b7e58da82f1e7d1d5fe0ad ("powerpc/booke64:
    Use SPRG0/3 scratch for bolted TLB miss & crit int"), Mihai Caraman
    made an attempt to resolve this conflict by restoring the VDSO value
    early in the critical interrupt, but this has some issues:

    - It's incompatible with EXCEPTION_COMMON which restores r13 from the
    by-then-overwritten scratch (this cost me some debugging time).
    - It forces critical exceptions to be a special case handled
    differently from even machine check and debug level exceptions.
    - It didn't occur to me that it was possible to make this work at all
    (by doing a final "ld r13, PACA_EXCRIT+EX_R13(r13)") until after
    I made (most of) this patch. :-)

    It might be worth investigating using a load rather than SPRG on return
    from all exceptions (except TLB misses where the scratch never leaves
    the SPRG) -- it could save a few cycles. Until then, let's stick with
    SPRG for all exceptions.

    Since we cannot use SPRG4-7 for scratch without corrupting the state of
    a KVM guest, move VDSO to SPRG7 on book3e. Since neither SPRG4-7 nor
    critical interrupts exist on book3s, SPRG3 is still used for VDSO
    there.

    Signed-off-by: Scott Wood
    Cc: Mihai Caraman
    Cc: Anton Blanchard
    Cc: Paul Mackerras
    Cc: kvm-ppc@vger.kernel.org

    Scott Wood
     

13 Mar, 2014

2 commits

  • Commit 595e4f7e697e ("KVM: PPC: Book3S HV: Use load/store_fp_state
    functions in HV guest entry/exit") changed the register usage in
    kvmppc_save_fp() and kvmppc_load_fp() but omitted changing the
    instructions that load and save VRSAVE. The result is that the
    VRSAVE value was loaded from a constant address, and saved to a
    location past the end of the vcpu struct, causing host kernel
    memory corruption and various kinds of host kernel crashes.

    This fixes the problem by using register r31, which contains the
    vcpu pointer, instead of r3 and r4.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Paolo Bonzini

    Paul Mackerras
     
  • Commit 7b490411c37f ("KVM: PPC: Book3S HV: Add new state for
    transactional memory") incorrectly added some duplicate code to the
    guest exit path because I didn't manage to clean up after a rebase
    correctly. This removes the extraneous material. The presence of
    this extraneous code causes host crashes whenever a guest is run.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Paolo Bonzini

    Paul Mackerras
     

30 Jan, 2014

1 commit


27 Jan, 2014

10 commits

  • Add new state for transactional memory (TM) to kvm_vcpu_arch. Also add
    asm-offset bits that are going to be required.

    This also moves the existing TFHAR, TFIAR and TEXASR SPRs into a
    CONFIG_PPC_TRANSACTIONAL_MEM section. This requires some code changes to
    ensure we still compile with CONFIG_PPC_TRANSACTIONAL_MEM=N. Much of the added
    the added #ifdefs are removed in a later patch when the bulk of the TM code is
    added.

    Signed-off-by: Michael Neuling
    Signed-off-by: Paul Mackerras
    [agraf: fix merge conflict]
    Signed-off-by: Alexander Graf

    Michael Neuling
     
  • We create a guest MSR from scratch when delivering exceptions in
    a few places. Instead of extracting LPCR[ILE] and inserting it
    into MSR_LE each time, we simply create a new variable intr_msr which
    contains the entire MSR to use. For a little-endian guest, userspace
    needs to set the ILE (interrupt little-endian) bit in the LPCR for
    each vcpu (or at least one vcpu in each virtual core).

    [paulus@samba.org - removed H_SET_MODE implementation from original
    version of the patch, and made kvmppc_set_lpcr update vcpu->arch.intr_msr.]

    Signed-off-by: Anton Blanchard
    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Anton Blanchard
     
  • The DABRX (DABR extension) register on POWER7 processors provides finer
    control over which accesses cause a data breakpoint interrupt. It
    contains 3 bits which indicate whether to enable accesses in user,
    kernel and hypervisor modes respectively to cause data breakpoint
    interrupts, plus one bit that enables both real mode and virtual mode
    accesses to cause interrupts. Currently, KVM sets DABRX to allow
    both kernel and user accesses to cause interrupts while in the guest.

    This adds support for the guest to specify other values for DABRX.
    PAPR defines a H_SET_XDABR hcall to allow the guest to set both DABR
    and DABRX with one call. This adds a real-mode implementation of
    H_SET_XDABR, which shares most of its code with the existing H_SET_DABR
    implementation. To support this, we add a per-vcpu field to store the
    DABRX value plus code to get and set it via the ONE_REG interface.

    For Linux guests to use this new hcall, userspace needs to add
    "hcall-xdabr" to the set of strings in the /chosen/hypertas-functions
    property in the device tree. If userspace does this and then migrates
    the guest to a host where the kernel doesn't include this patch, then
    userspace will need to implement H_SET_XDABR by writing the specified
    DABR value to the DABR using the ONE_REG interface. In that case, the
    old kernel will set DABRX to DABRX_USER | DABRX_KERNEL. That should
    still work correctly, at least for Linux guests, since Linux guests
    cope with getting data breakpoint interrupts in modes that weren't
    requested by just ignoring the interrupt, and Linux guests never set
    DABRX_BTI.

    The other thing this does is to make H_SET_DABR and H_SET_XDABR work
    on POWER8, which has the DAWR and DAWRX instead of DABR/X. Guests that
    know about POWER8 should use H_SET_MODE rather than H_SET_[X]DABR, but
    guests running in POWER7 compatibility mode will still use H_SET_[X]DABR.
    For them, this adds the logic to convert DABR/X values into DAWR/X values
    on POWER8.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • POWER8 has support for hypervisor doorbell interrupts. Though the
    kernel doesn't use them for IPIs on the powernv platform yet, it
    probably will in future, so this makes KVM cope gracefully if a
    hypervisor doorbell interrupt arrives while in a guest.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • * SRR1 wake reason field for system reset interrupt on wakeup from nap
    is now a 4-bit field on P8, compared to 3 bits on P7.

    * Set PECEDP in LPCR when napping because of H_CEDE so guest doorbells
    will wake us up.

    * Waking up from nap because of a guest doorbell interrupt is not a
    reason to exit the guest.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • Currently in book3s_hv_rmhandlers.S we have three places where we
    have woken up from nap mode and we check the reason field in SRR1
    to see what event woke us up. This consolidates them into a new
    function, kvmppc_check_wake_reason. It looks at the wake reason
    field in SRR1, and if it indicates that an external interrupt caused
    the wakeup, calls kvmppc_read_intr to check what sort of interrupt
    it was.

    This also consolidates the two places where we synthesize an external
    interrupt (0x500 vector) for the guest. Now, if the guest exit code
    finds that there was an external interrupt which has been handled
    (i.e. it was an IPI indicating that there is now an interrupt pending
    for the guest), it jumps to deliver_guest_interrupt, which is in the
    last part of the guest entry code, where we synthesize guest external
    and decrementer interrupts. That code has been streamlined a little
    and now clears LPCR[MER] when appropriate as well as setting it.

    The extra clearing of any pending IPI on a secondary, offline CPU
    thread before going back to nap mode has been removed. It is no longer
    necessary now that we have code to read and acknowledge IPIs in the
    guest exit path.

    This fixes a minor bug in the H_CEDE real-mode handling - previously,
    if we found that other threads were already exiting the guest when we
    were about to go to nap mode, we would branch to the cede wakeup path
    and end up looking in SRR1 for a wakeup reason. Now we branch to a
    point after we have checked the wakeup reason.

    This also fixes a minor bug in kvmppc_read_intr - previously it could
    return 0xff rather than 1, in the case where we find that a host IPI
    is pending after we have cleared the IPI. Now it returns 1.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • POWER8 has 512 sets in the TLB, compared to 128 for POWER7, so we need
    to do more tlbiel instructions when flushing the TLB on POWER8.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This adds fields to the struct kvm_vcpu_arch to store the new
    guest-accessible SPRs on POWER8, adds code to the get/set_one_reg
    functions to allow userspace to access this state, and adds code to
    the guest entry and exit to context-switch these SPRs between host
    and guest.

    Note that DPDES (Directed Privileged Doorbell Exception State) is
    shared between threads on a core; hence we store it in struct
    kvmppc_vcore and have the master thread save and restore it.

    Signed-off-by: Michael Neuling
    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Michael Neuling
     
  • On a threaded processor such as POWER7, we group VCPUs into virtual
    cores and arrange that the VCPUs in a virtual core run on the same
    physical core. Currently we don't enforce any correspondence between
    virtual thread numbers within a virtual core and physical thread
    numbers. Physical threads are allocated starting at 0 on a first-come
    first-served basis to runnable virtual threads (VCPUs).

    POWER8 implements a new "msgsndp" instruction which guest kernels can
    use to interrupt other threads in the same core or sub-core. Since
    the instruction takes the destination physical thread ID as a parameter,
    it becomes necessary to align the physical thread IDs with the virtual
    thread IDs, that is, to make sure virtual thread N within a virtual
    core always runs on physical thread N.

    This means that it's possible that thread 0, which is where we call
    __kvmppc_vcore_entry, may end up running some other vcpu than the
    one whose task called kvmppc_run_core(), or it may end up running
    no vcpu at all, if for example thread 0 of the virtual core is
    currently executing in userspace. However, we do need thread 0
    to be responsible for switching the MMU -- a previous version of
    this patch that had other threads switching the MMU was found to
    be responsible for occasional memory corruption and machine check
    interrupts in the guest on POWER7 machines.

    To accommodate this, we no longer pass the vcpu pointer to
    __kvmppc_vcore_entry, but instead let the assembly code load it from
    the PACA. Since the assembly code will need to know the kvm pointer
    and the thread ID for threads which don't have a vcpu, we move the
    thread ID into the PACA and we add a kvm pointer to the virtual core
    structure.

    In the case where thread 0 has no vcpu to run, it still calls into
    kvmppc_hv_entry in order to do the MMU switch, and then naps until
    either its vcpu is ready to run in the guest, or some other thread
    needs to exit the guest. In the latter case, thread 0 jumps to the
    code that switches the MMU back to the host. This control flow means
    that now we switch the MMU before loading any guest vcpu state.
    Similarly, on guest exit we now save all the guest vcpu state before
    switching the MMU back to the host. This has required substantial
    code movement, making the diff rather large.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • POWER8 doesn't have the DABR and DABRX registers; instead it has
    new DAWR/DAWRX registers, which will be handled in a later patch.

    Signed-off-by: Michael Neuling
    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Michael Neuling
     

09 Jan, 2014

2 commits


18 Dec, 2013

1 commit

  • We don't use PACATOC for PR. Avoid updating HOST_R2 with PR
    KVM mode when both HV and PR are enabled in the kernel. Without this we
    get the below crash

    (qemu)
    Unable to handle kernel paging request for data at address 0xffffffffffff8310
    Faulting instruction address: 0xc00000000001d5a4
    cpu 0x2: Vector: 300 (Data Access) at [c0000001dc53aef0]
    pc: c00000000001d5a4: .vtime_delta.isra.1+0x34/0x1d0
    lr: c00000000001d760: .vtime_account_system+0x20/0x60
    sp: c0000001dc53b170
    msr: 8000000000009032
    dar: ffffffffffff8310
    dsisr: 40000000
    current = 0xc0000001d76c62d0
    paca = 0xc00000000fef1100 softe: 0 irq_happened: 0x01
    pid = 4472, comm = qemu-system-ppc
    enter ? for help
    [c0000001dc53b200] c00000000001d760 .vtime_account_system+0x20/0x60
    [c0000001dc53b290] c00000000008d050 .kvmppc_handle_exit_pr+0x60/0xa50
    [c0000001dc53b340] c00000000008f51c kvm_start_lightweight+0xb4/0xc4
    [c0000001dc53b510] c00000000008cdf0 .kvmppc_vcpu_run_pr+0x150/0x2e0
    [c0000001dc53b9e0] c00000000008341c .kvmppc_vcpu_run+0x2c/0x40
    [c0000001dc53ba50] c000000000080af4 .kvm_arch_vcpu_ioctl_run+0x54/0x1b0
    [c0000001dc53bae0] c00000000007b4c8 .kvm_vcpu_ioctl+0x478/0x730
    [c0000001dc53bca0] c0000000002140cc .do_vfs_ioctl+0x4ac/0x770
    [c0000001dc53bd80] c0000000002143e8 .SyS_ioctl+0x58/0xb0
    [c0000001dc53be30] c000000000009e58 syscall_exit+0x0/0x98

    Signed-off-by: Alexander Graf

    Aneesh Kumar K.V
     

21 Nov, 2013

1 commit

  • In some scene, e.g openstack CI, PR guest can trigger "sc 1" frequently,
    this patch optimizes the path by directly delivering BOOK3S_INTERRUPT_SYSCALL
    to HV guest, so powernv can return to HV guest without heavy exit, i.e,
    no need to swap TLB, HTAB,.. etc

    Signed-off-by: Liu Ping Fan
    Signed-off-by: Alexander Graf

    Liu Ping Fan
     

19 Nov, 2013

1 commit

  • Some users have reported instances of the host hanging with secondary
    threads of a core waiting for the primary thread to exit the guest,
    and the primary thread stuck in nap mode. This prompted a review of
    the memory barriers in the guest entry/exit code, and this is the
    result. Most of these changes are the suggestions of Dean Burdick
    .

    The barriers between updating napping_threads and reading the
    entry_exit_count on the one hand, and updating entry_exit_count and
    reading napping_threads on the other, need to be isync not lwsync,
    since we need to ensure that either the napping_threads update or the
    entry_exit_count update get seen. It is not sufficient to order the
    load vs. lwarx, as lwsync does; we need to order the load vs. the
    stwcx., so we need isync.

    In addition, we need a full sync before sending IPIs to wake other
    threads from nap, to ensure that the write to the entry_exit_count is
    visible before the IPI occurs.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     

04 Nov, 2013

1 commit


17 Oct, 2013

11 commits

  • With this patch if HV is included, interrupts come in to the HV version
    of the kvmppc_interrupt code, which then jumps to the PR handler,
    renamed to kvmppc_interrupt_pr, if the guest is a PR guest. This helps
    in enabling both HV and PR, which we do in later patch

    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Alexander Graf

    Aneesh Kumar K.V
     
  • When an interrupt or exception happens in the guest that comes to the
    host, the CPU goes to hypervisor real mode (MMU off) to handle the
    exception but doesn't change the MMU context. After saving a few
    registers, we then clear the "in guest" flag. If, for any reason,
    we get an exception in the real-mode code, that then gets handled
    by the normal kernel exception handlers, which turn the MMU on. This
    is disastrous if the MMU is still set to the guest context, since we
    end up executing instructions from random places in the guest kernel
    with hypervisor privilege.

    In order to catch this situation, we define a new value for the "in guest"
    flag, KVM_GUEST_MODE_HOST_HV, to indicate that we are in hypervisor real
    mode with guest MMU context. If the "in guest" flag is set to this value,
    we branch off to an emergency handler. For the moment, this just does
    a branch to self to stop the CPU from doing anything further.

    While we're here, we define another new flag value to indicate that we
    are in a HV guest, as distinct from a PR guest. This will be useful
    when we have a kernel that can support both PR and HV guests concurrently.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • Both PR and HV KVM have separate, identical copies of the
    kvmppc_skip_interrupt and kvmppc_skip_Hinterrupt handlers that are
    used for the situation where an interrupt happens when loading the
    instruction that caused an exit from the guest. To eliminate this
    duplication and make it easier to compile in both PR and HV KVM,
    this moves this code to arch/powerpc/kernel/exceptions-64s.S along
    with other kernel interrupt handler code.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This enables us to use the Processor Compatibility Register (PCR) on
    POWER7 to put the processor into architecture 2.05 compatibility mode
    when running a guest. In this mode the new instructions and registers
    that were introduced on POWER7 are disabled in user mode. This
    includes all the VSX facilities plus several other instructions such
    as ldbrx, stdbrx, popcntw, popcntd, etc.

    To select this mode, we have a new register accessible through the
    set/get_one_reg interface, called KVM_REG_PPC_ARCH_COMPAT. Setting
    this to zero gives the full set of capabilities of the processor.
    Setting it to one of the "logical" PVR values defined in PAPR puts
    the vcpu into the compatibility mode for the corresponding
    architecture level. The supported values are:

    0x0f000002 Architecture 2.05 (POWER6)
    0x0f000003 Architecture 2.06 (POWER7)
    0x0f100003 Architecture 2.06+ (POWER7+)

    Since the PCR is per-core, the architecture compatibility level and
    the corresponding PCR value are stored in the struct kvmppc_vcore, and
    are therefore shared between all vcpus in a virtual core.

    Signed-off-by: Paul Mackerras
    [agraf: squash in fix to add missing break statements and documentation]
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • POWER7 and later IBM server processors have a register called the
    Program Priority Register (PPR), which controls the priority of
    each hardware CPU SMT thread, and affects how fast it runs compared
    to other SMT threads. This priority can be controlled by writing to
    the PPR or by use of a set of instructions of the form or rN,rN,rN
    which are otherwise no-ops but have been defined to set the priority
    to particular levels.

    This adds code to context switch the PPR when entering and exiting
    guests and to make the PPR value accessible through the SET/GET_ONE_REG
    interface. When entering the guest, we set the PPR as late as
    possible, because if we are setting a low thread priority it will
    make the code run slowly from that point on. Similarly, the
    first-level interrupt handlers save the PPR value in the PACA very
    early on, and set the thread priority to the medium level, so that
    the interrupt handling code runs at a reasonable speed.

    Acked-by: Benjamin Herrenschmidt
    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This adds the ability to have a separate LPCR (Logical Partitioning
    Control Register) value relating to a guest for each virtual core,
    rather than only having a single value for the whole VM. This
    corresponds to what real POWER hardware does, where there is a LPCR
    per CPU thread but most of the fields are required to have the same
    value on all active threads in a core.

    The per-virtual-core LPCR can be read and written using the
    GET/SET_ONE_REG interface. Userspace can can only modify the
    following fields of the LPCR value:

    DPFD Default prefetch depth
    ILE Interrupt little-endian
    TC Translation control (secondary HPT hash group search disable)

    We still maintain a per-VM default LPCR value in kvm->arch.lpcr, which
    contains bits relating to memory management, i.e. the Virtualized
    Partition Memory (VPM) bits and the bits relating to guest real mode.
    When this default value is updated, the update needs to be propagated
    to the per-vcore values, so we add a kvmppc_update_lpcr() helper to do
    that.

    Signed-off-by: Paul Mackerras
    [agraf: fix whitespace]
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • The yield count in the VPA is supposed to be incremented every time
    we enter the guest, and every time we exit the guest, so that its
    value is even when the vcpu is running in the guest and odd when it
    isn't. However, it's currently possible that we increment the yield
    count on the way into the guest but then find that other CPU threads
    are already exiting the guest, so we go back to nap mode via the
    secondary_too_late label. In this situation we don't increment the
    yield count again, breaking the relationship between the LSB of the
    count and whether the vcpu is in the guest.

    To fix this, we move the increment of the yield count to a point
    after we have checked whether other CPU threads are exiting.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This moves the code in book3s_hv_rmhandlers.S that reads any pending
    interrupt from the XICS interrupt controller, and works out whether
    it is an IPI for the guest, an IPI for the host, or a device interrupt,
    into a new function called kvmppc_read_intr. Later patches will
    need this.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • We have two paths into and out of the low-level guest entry and exit
    code: from a vcpu task via kvmppc_hv_entry_trampoline, and from the
    system reset vector for an offline secondary thread on POWER7 via
    kvm_start_guest. Currently both just branch to kvmppc_hv_entry to
    enter the guest, and on guest exit, we test the vcpu physical thread
    ID to detect which way we came in and thus whether we should return
    to the vcpu task or go back to nap mode.

    In order to make the code flow clearer, and to keep the code relating
    to each flow together, this turns kvmppc_hv_entry into a subroutine
    that follows the normal conventions for call and return. This means
    that kvmppc_hv_entry_trampoline() and kvmppc_hv_entry() now establish
    normal stack frames, and we use the normal stack slots for saving
    return addresses rather than local_paca->kvm_hstate.vmhandler. Apart
    from that this is mostly moving code around unchanged.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • This allows guests to have a different timebase origin from the host.
    This is needed for migration, where a guest can migrate from one host
    to another and the two hosts might have a different timebase origin.
    However, the timebase seen by the guest must not go backwards, and
    should go forwards only by a small amount corresponding to the time
    taken for the migration.

    Therefore this provides a new per-vcpu value accessed via the one_reg
    interface using the new KVM_REG_PPC_TB_OFFSET identifier. This value
    defaults to 0 and is not modified by KVM. On entering the guest, this
    value is added onto the timebase, and on exiting the guest, it is
    subtracted from the timebase.

    This is only supported for recent POWER hardware which has the TBU40
    (timebase upper 40 bits) register. Writing to the TBU40 register only
    alters the upper 40 bits of the timebase, leaving the lower 24 bits
    unchanged. This provides a way to modify the timebase for guest
    migration without disturbing the synchronization of the timebase
    registers across CPU cores. The kernel rounds up the value given
    to a multiple of 2^24.

    Timebase values stored in KVM structures (struct kvm_vcpu, struct
    kvmppc_vcore, etc.) are stored as host timebase values. The timebase
    values in the dispatch trace log need to be guest timebase values,
    however, since that is read directly by the guest. This moves the
    setting of vcpu->arch.dec_expires on guest exit to a point after we
    have restored the host timebase so that vcpu->arch.dec_expires is a
    host timebase value.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     
  • Currently we are not saving and restoring the SIAR and SDAR registers in
    the PMU (performance monitor unit) on guest entry and exit. The result
    is that performance monitoring tools in the guest could get false
    information about where a program was executing and what data it was
    accessing at the time of a performance monitor interrupt. This fixes
    it by saving and restoring these registers along with the other PMU
    registers on guest entry/exit.

    This also provides a way for userspace to access these values for a
    vcpu via the one_reg interface.

    Signed-off-by: Paul Mackerras
    Signed-off-by: Alexander Graf

    Paul Mackerras
     

10 Oct, 2013

1 commit

  • This fixes a typo in the code that saves the guest DSCR (Data Stream
    Control Register) into the kvm_vcpu_arch struct on guest exit. The
    effect of the typo was that the DSCR value was saved in the wrong place,
    so changes to the DSCR by the guest didn't persist across guest exit
    and entry, and some host kernel memory got corrupted.

    Cc: stable@vger.kernel.org [v3.1+]
    Signed-off-by: Paul Mackerras
    Acked-by: Alexander Graf
    Signed-off-by: Paolo Bonzini

    Paul Mackerras