17 May, 2010

40 commits

  • When CPU_UP_CANCELED, hardware_enable() has not been called at the CPU
    which is going up because raw_notifier_call_chain(CPU_ONLINE)
    has not been called for this cpu.

    Drop the handling for CPU_UP_CANCELED.

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Avi Kivity

    Lai Jiangshan
     
  • Since commit bf47a760f66ad, we no longer handle ptes with the global bit
    set specially, so there is no reason to distinguish between shadow pages
    created with cr4.gpe set and clear.

    Such tracking is expensive when the guest toggles cr4.pge, so drop it.

    Signed-off-by: Avi Kivity

    Avi Kivity
     
  • The RCU/SRCU API have already changed for proving RCU usage.

    I got the following dmesg when PROVE_RCU=y because we used incorrect API.
    This patch coverts rcu_deference() to srcu_dereference() or family API.

    ===================================================
    [ INFO: suspicious rcu_dereference_check() usage. ]
    ---------------------------------------------------
    arch/x86/kvm/mmu.c:3020 invoked rcu_dereference_check() without protection!

    other info that might help us debug this:

    rcu_scheduler_active = 1, debug_locks = 0
    2 locks held by qemu-system-x86/8550:
    #0: (&kvm->slots_lock){+.+.+.}, at: [] kvm_set_memory_region+0x29/0x50 [kvm]
    #1: (&(&kvm->mmu_lock)->rlock){+.+...}, at: [] kvm_arch_commit_memory_region+0xa6/0xe2 [kvm]

    stack backtrace:
    Pid: 8550, comm: qemu-system-x86 Not tainted 2.6.34-rc4-tip-01028-g939eab1 #27
    Call Trace:
    [] lockdep_rcu_dereference+0xaa/0xb3
    [] kvm_mmu_calculate_mmu_pages+0x44/0x7d [kvm]
    [] kvm_arch_commit_memory_region+0xb7/0xe2 [kvm]
    [] __kvm_set_memory_region+0x636/0x6e2 [kvm]
    [] kvm_set_memory_region+0x37/0x50 [kvm]
    [] vmx_set_tss_addr+0x46/0x5a [kvm_intel]
    [] kvm_arch_vm_ioctl+0x17a/0xcf8 [kvm]
    [] ? unlock_page+0x27/0x2c
    [] ? __do_fault+0x3a9/0x3e1
    [] kvm_vm_ioctl+0x364/0x38d [kvm]
    [] ? up_read+0x23/0x3d
    [] vfs_ioctl+0x32/0xa6
    [] do_vfs_ioctl+0x495/0x4db
    [] ? fget_light+0xc2/0x241
    [] ? do_sys_open+0x104/0x116
    [] ? retint_swapgs+0xe/0x13
    [] sys_ioctl+0x47/0x6a
    [] system_call_fastpath+0x16/0x1b

    Signed-off-by: Lai Jiangshan
    Signed-off-by: Avi Kivity

    Lai Jiangshan
     
  • Signed-off-by: Avi Kivity

    Avi Kivity
     
  • Quote from Avi:

    |Just change the assignment to a 'goto restart;' please,
    |I don't like playing with list_for_each internals.

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • If kvm_task_switch() fails code exits to userspace without specifying
    exit reason, so the previous exit reason is reused by userspace. Fix
    this by specifying exit reason correctly.

    Signed-off-by: Gleb Natapov
    Signed-off-by: Marcelo Tosatti

    Gleb Natapov
     
  • 'vcpu' is unused, remove it

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • Define 'multimapped' as 'bool'.

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • Remove 'struct kvm_unsync_walk' since it's not used.

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • emulator_task_switch() should return -1 for failure and 0 for success to
    the caller, just like x86_emulate_insn() does.

    Signed-off-by: Gleb Natapov
    Signed-off-by: Marcelo Tosatti

    Gleb Natapov
     
  • There is no real distinction between glevels=3 and glevels=4; both have
    exactly the same format and the code is treated exactly the same way. Drop
    role.glevels and replace is with role.cr4_pae (which is meaningful). This
    simplifies the code a bit.

    As a side effect, it allows sharing shadow page tables between pae and
    longmode guest page tables at the same guest page.

    Signed-off-by: Avi Kivity
    Signed-off-by: Marcelo Tosatti

    Avi Kivity
     
  • When a fault triggers a task switch, the error code, if existent, has to
    be pushed on the new task's stack. Implement the missing bits.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Marcelo Tosatti

    Jan Kiszka
     
  • Stop the switch immediately if task_switch_16/32 returned an error. Only
    if that step succeeded, the switch should actually take place and update
    any register states.

    Signed-off-by: Jan Kiszka
    Signed-off-by: Marcelo Tosatti

    Jan Kiszka
     
  • We can call kvm_mmu_pte_write() directly from
    emulator_cmpxchg_emulated() instead of passing mmu_only down to
    emulator_write_emulated_onepage() and call it there.

    Signed-off-by: Gleb Natapov
    Signed-off-by: Marcelo Tosatti

    Gleb Natapov
     
  • This patch limits the number of pages per memory slot to make
    us free from extra care about type issues.

    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Marcelo Tosatti

    Takuya Yoshikawa
     
  • Currently both SVM and VMX have their own DR handling code. Move it to
    x86.c.

    Acked-by: Jan Kiszka
    Signed-off-by: Gleb Natapov
    Signed-off-by: Marcelo Tosatti

    Gleb Natapov
     
  • On SVM we set the instruction length of skipped instructions
    to hard-coded, well known values, which could be wrong when (bogus,
    but valid) prefixes (REX, segment override) are used.
    Newer AMD processors (Fam10h 45nm and better, aka. PhenomII or
    AthlonII) have an explicit NEXTRIP field in the VMCB containing the
    desired information.
    Since it is cheap to do so, we use this field to override the guessed
    value on newer processors.
    A fix for older CPUs would be rather expensive, as it would require
    to fetch and partially decode the instruction. As the problem is not
    a security issue and needs special, handcrafted code to trigger
    (no compiler will ever generate such code), I omit a fix for older
    CPUs.
    If someone is interested, I have both a patch for these CPUs as well as
    demo code triggering this issue: It segfaults under KVM, but runs
    perfectly on native Linux.

    Signed-off-by: Andre Przywara
    Signed-off-by: Marcelo Tosatti

    Andre Przywara
     
  • MAXPHYADDR is derived from cpuid 0x80000008, but when that isn't present, we
    get some random value.

    Fix by checking first that cpuid 0x80000008 is supported.

    Acked-by: Pekka Enberg
    Signed-off-by: Avi Kivity
    Signed-off-by: Marcelo Tosatti

    Avi Kivity
     
  • Log emulated instructions in ftrace, especially if they failed.

    Signed-off-by: Avi Kivity
    Signed-off-by: Marcelo Tosatti

    Avi Kivity
     
  • Currently if we an instruction spans a page boundary, when we fetch the
    second half we overwrite the first half. This prevents us from tracing
    the full instruction opcodes.

    Fix by appending the second half to the first.

    Signed-off-by: Avi Kivity
    Signed-off-by: Marcelo Tosatti

    Avi Kivity
     
  • Commit a0abee86af2d1f048dbe99d2bcc4a2cefe685617 introduced unsetting of the
    IRQ line from userspace. This added a new core specific callback that I
    apparently forgot to add for BookE.

    So let's add the callback for BookE as well, making it build again.

    Signed-off-by: Alexander Graf
    Signed-off-by: Marcelo Tosatti

    Alexander Graf
     
  • After is_rsvd_bits_set() checks, EFER.NXE must be enabled if NX bit is seted

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • kvm_mmu_page.oos_link is not used, so remove it

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • This patch does:
    - 'sp' parameter in inspect_spte_fn() is not used, so remove it
    - fix 'kvm' and 'slots' is not defined in count_rmaps()
    - fix a bug in inspect_spte_has_rmap()

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • Book3S knows how to convert floats to doubles and vice versa. BookE doesn't.
    So let's make sure we don't export them on BookE.

    This fixes a link error on BookE with CONFIG_KVM=y.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • BookE KVM doesn't know about QPRs, so let's not try to access then.

    This fixes a build error on BookE KVM.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • Signed-off-by: Avi Kivity

    Avi Kivity
     
  • Cell can't handle MSR_FE0 and MSR_FE1 too well. It gets dog slow.
    So let's just override the guest whenever we see one of the two and mask them
    out. See commit ddf5f75a16b3e7460ffee881795aa168dffcd0cf for reference.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • Bool defaults to at least byte width. We usually only want to waste a single
    bit on this. So let's move all the bool values to bitfields, potentially
    saving memory.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • Some constants were bigger than ints. Let's mark them as such so we don't
    accidently truncate them.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • Some HTAB providers (namely the PS3) ignore the SECONDARY flag. They
    just put an entry in the htab as secondary when they see fit.

    So we need to check the return value of htab_insert to remember the
    correct slot id so we can actually invalidate the entry again.

    Fixes KVM on the PS3.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • Mac OS X uses the dcba instruction. According to the specification it doesn't
    guarantee any functionality, so let's just emulate it as nop.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • On most systems we need to emulate dcbz when running 32 bit guests. So
    far we've been rather slack, not giving correct DSISR values to the guest.

    This patch makes the emulation more accurate, introducing a difference
    between "page not mapped" and "write protection fault". While at it, it
    also speeds up dcbz emulation by an order of magnitude by using kmap.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • The FPU/Altivec/VSX enablement also brought access to some structure
    elements that are only defined when the respective config options
    are enabled.

    Unfortuately I forgot to check for the config options at some places,
    so let's do that now.

    Unbreaks the build when CONFIG_VSX is not set.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • MOL uses its own hypercall interface to call back into userspace when
    the guest wants to do something.

    So let's implement that as an exit reason, specify it with a CAP and
    only really use it when userspace wants us to.

    The only user of it so far is MOL.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • Some times we don't want all capabilities to be available to all
    our vcpus. One example for that is the OSI interface, implemented
    in the next patch.

    In order to have a generic mechanism in how to enable capabilities
    individually, this patch introduces a new ioctl that can be used
    for this purpose. That way features we don't want in all guests or
    userspace configurations can just not be enabled and we're good.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • Mac OS X has some applications - namely the Finder - that require alignment
    interrupts to work properly. So we need to implement them.

    But the spec for 970 and 750 also looks different. While 750 requires the
    DSISR and DAR fields to reflect some instruction bits (DSISR) and the fault
    address (DAR), the 970 declares this as an optional feature. So we need
    to reconstruct DSISR and DAR manually.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • We get MMIOs with the weirdest instructions. But every time we do,
    we need to improve our emulator to implement them.

    So let's do that - this time it's lbzux and lhax's round.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • We have a 32 bit value in the PACA to store XER in. We also do an stw
    when storing XER in there. But then we load it with ld, completely
    screwing it up on every entry.

    Welcome to the Big Endian world.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • BATs can't only be written to, you can also read them out!
    So let's implement emulation for reading BAT values again.

    While at it, I also made BAT setting flush the segment cache,
    so we're absolutely sure there's no MMU state left when writing
    BATs.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf