02 Aug, 2010

17 commits


01 Aug, 2010

23 commits

  • This patch converts unnecessary divide and modulo operations
    in the KVM large page related code into logical operations.
    This allows to convert gfn_t to u64 while not breaking 32
    bit builds.

    Signed-off-by: Joerg Roedel
    Signed-off-by: Marcelo Tosatti

    Joerg Roedel
     
  • This patch fixes the following warning.

    ===================================================
    [ INFO: suspicious rcu_dereference_check() usage. ]
    ---------------------------------------------------
    include/linux/kvm_host.h:259 invoked rcu_dereference_check() without
    protection!

    other info that might help us debug this:

    rcu_scheduler_active = 1, debug_locks = 0
    no locks held by qemu-system-x86/29679.

    stack backtrace:
    Pid: 29679, comm: qemu-system-x86 Not tainted 2.6.35-rc3+ #200
    Call Trace:
    [] lockdep_rcu_dereference+0xa8/0xb1
    [] kvm_iommu_unmap_memslots+0xc9/0xde [kvm]
    [] kvm_iommu_unmap_guest+0x40/0x4e [kvm]
    [] kvm_arch_destroy_vm+0x1a/0x186 [kvm]
    [] kvm_put_kvm+0x110/0x167 [kvm]
    [] kvm_vcpu_release+0x18/0x1c [kvm]
    [] fput+0x22a/0x3a0
    [] filp_close+0xb4/0xcd
    [] put_files_struct+0x1b7/0x36b
    [] ? put_files_struct+0x48/0x36b
    [] ? do_raw_spin_unlock+0x118/0x160
    [] exit_files+0x6d/0x75
    [] do_exit+0x47d/0xc60
    [] ? _raw_spin_unlock_irq+0x30/0x36
    [] do_group_exit+0xcf/0x134
    [] get_signal_to_deliver+0x732/0x81d
    [] ? cpu_clock+0x4e/0x60
    [] do_notify_resume+0x117/0xc43
    [] ? trace_hardirqs_on+0xd/0xf
    [] ? sys_rt_sigtimedwait+0x2b5/0x3bf
    [] ? trace_hardirqs_off_thunk+0x3a/0x3c
    [] ? sysret_signal+0x5/0x3d
    [] int_signal+0x12/0x17

    Signed-off-by: Sheng Yang
    Signed-off-by: Marcelo Tosatti

    Sheng Yang
     
  • We just introduced generic functions to handle shadow pages on PPC.
    This patch makes the respective backends make use of them, getting
    rid of a lot of duplicate code along the way.

    Signed-off-by: Alexander Graf
    Signed-off-by: Marcelo Tosatti

    Alexander Graf
     
  • Currently the shadow paging code keeps an array of entries it knows about.
    Whenever the guest invalidates an entry, we loop through that entry,
    trying to invalidate matching parts.

    While this is a really simple implementation, it is probably the most
    ineffective one possible. So instead, let's keep an array of lists around
    that are indexed by a hash. This way each PTE can be added by 4 list_add,
    removed by 4 list_del invocations and the search only needs to loop through
    entries that share the same hash.

    This patch implements said lookup and exports generic functions that both
    the 32-bit and 64-bit backend can use.

    Signed-off-by: Alexander Graf
    Signed-off-by: Marcelo Tosatti

    Alexander Graf
     
  • Cleanup this function that we are already get the direct sp's access

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • If the mapping is writable but the dirty flag is not set, we will find
    the read-only direct sp and setup the mapping, then if the write #PF
    occur, we will mark this mapping writable in the read-only direct sp,
    now, other real read-only mapping will happily write it without #PF.

    It may hurt guest's COW

    Fixed by re-install the mapping when write #PF occur.

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • In no-direct mapping, we mark sp is 'direct' when we mapping the
    guest's larger page, but its access is encoded form upper page-struct
    entire not include the last mapping, it will cause access conflict.

    For example, have this mapping:
    [W]
    / PDE1 -> |---|
    P[W] | | LPA
    \ PDE2 -> |---|
    [R]

    P have two children, PDE1 and PDE2, both PDE1 and PDE2 mapping the
    same lage page(LPA). The P's access is WR, PDE1's access is WR,
    PDE2's access is RO(just consider read-write permissions here)

    When guest access PDE1, we will create a direct sp for LPA, the sp's
    access is from P, is W, then we will mark the ptes is W in this sp.

    Then, guest access PDE2, we will find LPA's shadow page, is the same as
    PDE's, and mark the ptes is RO.

    So, if guest access PDE1, the incorrect #PF is occured.

    Fixed by encode the last mapping access into direct shadow page

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • While we sync many unsync sp at one time(in mmu_sync_children()),
    we may mapping the spte writable, it's dangerous, if one unsync
    sp's mapping gfn is another unsync page's gfn.

    For example:

    SP1.pte[0] = P
    SP2.gfn's pfn = P
    [SP1.pte[0] = SP2.gfn's pfn]

    First, we write protected SP1 and SP2, but SP1 and SP2 are still the
    unsync sp.

    Then, sync SP1 first, it will detect SP1.pte[0].gfn only has one unsync-sp,
    that is SP2, so it will mapping it writable, but we plan to sync SP2 soon,
    at this point, the SP2->unsync is not reliable since later we sync SP2 but
    SP2->gfn is already writable.

    So the final result is: SP2 is the sync page but SP2.gfn is writable.

    This bug will corrupt guest's page table, fixed by mark read-only mapping
    if the mapped gfn has shadow pages.

    Signed-off-by: Xiao Guangrong
    Signed-off-by: Marcelo Tosatti

    Xiao Guangrong
     
  • Some guest device driver may leverage the "Non-Snoop" I/O, and explicitly
    WBINVD or CLFLUSH to a RAM space. Since migration may occur before WBINVD or
    CLFLUSH, we need to maintain data consistency either by:
    1: flushing cache (wbinvd) when the guest is scheduled out if there is no
    wbinvd exit, or
    2: execute wbinvd on all dirty physical CPUs when guest wbinvd exits.

    Signed-off-by: Yaozu (Eddie) Dong
    Signed-off-by: Sheng Yang
    Signed-off-by: Marcelo Tosatti

    Sheng Yang
     
  • Signed-off-by: Avi Kivity
    Signed-off-by: Marcelo Tosatti

    Avi Kivity
     
  • No need to reload the mmu in between two different vcpu->requests checks.

    kvm_mmu_reload() may trigger KVM_REQ_TRIPLE_FAULT, but that will be caught
    during atomic guest entry later.

    Signed-off-by: Avi Kivity
    Signed-off-by: Marcelo Tosatti

    Avi Kivity
     
  • Older versions of 32-bit linux have a "Checking 'hlt' instruction"
    test where they repeatedly call the 'hlt' instruction, and then
    expect a timer interrupt to kick the CPU out of halt. This happens
    before any LAPIC or IOAPIC setup happens, which means that all of
    the APIC's are in virtual wire mode at this point. Unfortunately,
    the current implementation of virtual wire mode is hardcoded to
    only kick the BSP, so if a crash+kexec occurs on a different
    vcpu, it will never get kicked.

    This patch makes pic_unlock() do the equivalent of
    kvm_irq_delivery_to_apic() for the IOAPIC code. That is, it runs
    through all of the vcpus looking for one that is in virtual wire
    mode. In the normal case where LAPICs and IOAPICs are configured,
    this won't be used at all. In the bootstrap phase of a modern
    OS, before the LAPICs and IOAPICs are configured, this will have
    exactly the same behavior as today; VCPU0 is always looked at
    first, so it will always get out of the loop after the first
    iteration. This will only go through the loop more than once
    during a kexec/kdump, in which case it will only do it a few times
    until the kexec'ed kernel programs the LAPIC and IOAPIC.

    Signed-off-by: Chris Lalancette
    Signed-off-by: Avi Kivity

    Chris Lalancette
     
  • kvm_ia64_sync_dirty_log() is a helper function for kvm_vm_ioctl_get_dirty_log()
    which copies ia64's arch specific dirty bitmap to general one in memslot.
    So doing sanity checks in this function is unnatural. We move these checks
    outside of this and change the prototype appropriately.

    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Avi Kivity

    Takuya Yoshikawa
     
  • kvm_get_dirty_log() calls copy_to_user(). So we need to narrow the
    dirty_log_lock spin_lock section not to include this.

    Signed-off-by: Takuya Yoshikawa
    Signed-off-by: Avi Kivity

    Takuya Yoshikawa
     
  • When a guest sets its SR entry to invalid, we may still find a
    corresponding entry in a BAT. So we need to make sure we're not
    faulting on invalid SR entries, but instead just claim them to be
    BAT resolved.

    This resolves breakage experienced when using libogc based guests.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • The linux kernel already provides a hash function. Let's reuse that
    instead of reinventing the wheel!

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • Initially we had to search for pte entries to invalidate them. Since
    the logic has improved since then, we can just get rid of the search
    function.

    Signed-off-by: Alexander Graf
    Signed-off-by: Avi Kivity

    Alexander Graf
     
  • is_hwpoison_address accesses the page table, so the caller must hold
    current->mm->mmap_sem in read mode. So fix its usage in hva_to_pfn of
    kvm accordingly.

    Comment is_hwpoison_address to remind other users.

    Reported-by: Avi Kivity
    Signed-off-by: Huang Ying
    Signed-off-by: Avi Kivity

    Huang Ying
     
  • Enable Intel(R) Advanced Vector Extension(AVX) for guest.

    The detection of AVX feature includes OSXSAVE bit testing. When OSXSAVE bit is
    not set, even if AVX is supported, the AVX instruction would result in UD as
    well. So we're safe to expose AVX bits to guest directly.

    Signed-off-by: Sheng Yang
    Signed-off-by: Avi Kivity

    Sheng Yang
     
  • If a process with a memory slot is COWed, the page will change its address
    (despite having an elevated reference count). This breaks internal memory
    slots which have their physical addresses loaded into vmcs registers (see
    the APIC access memory slot).

    Signed-off-by: Avi Kivity

    Avi Kivity
     
  • May be used for distinguishing between internal and user slots, or for sorting
    slots in size order.

    Signed-off-by: Avi Kivity

    Avi Kivity
     
  • Usually the vcpu->requests bitmap is sparse, so a test_and_clear_bit() for
    each request generates a large number of unneeded atomics if a bit is set.

    Replace with a separate test/clear sequence. This is safe since there is
    no clear_bit() outside the vcpu thread.

    Signed-off-by: Avi Kivity

    Avi Kivity
     
  • Makes it a little more readable and hackable.

    Signed-off-by: Avi Kivity

    Avi Kivity